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I. INTRODUCTION 

A. Apparent Randomness 

Natural processes appear unpredictable to varying de- 
grees and for several reasons. First, and most obviously, 
one may not know the "rules" or equations that govern 
a particular system. That is, an observer may have only 
incomplete knowledge of the forces controlling a process. 
Laplace was well aware of these sources of apparent ran- 
domness; as he commented two centuries ago in motivat- 
ing his Philosophical Essay on Probabilities Q: 

But ignorance of the different causes in- 
volved in the production of events, ... taken 
together with the imperfection of analysis, 
prevents our reaching the same certainty 
about the vast majority of phenomena. Thus 
there are things that are uncertain for us, 
things more or less probable, and we seek to 
compensate for the impossibility of knowing 
them by determining their different degrees 
of likelihood. 

Second, there may be mechanisms intrinsic to a process 
that amplify unknown or uncontrolled fluctuations to un- 
predictable macroscopic behavior. Manifestations of this 
sort of randomness include deterministic chaos and frac- 
tal separatrix structures bounding different basins of at- 
traction. As Poincare noted : 

... it may happen that small differences 
in the initial conditions produce very great 
ones in the final phenomena. A small error 
in the former will produce an enormous error 
in the latter. Prediction becomes impossible, 
and we have the fortuitous phenomenon. 

Unpredictability of this kind also arises from sensitive 
dependence on parameters, such as that seen in non- 
structurally stable systems with continuous bifurcations 



g or from sensitive dependence on boundary conditions. 
Knowledge of the governing equations of motion does lit- 
tle to make these kinds of intrinsic randomness go away. 

Third, and more subtly, there exists a wide array of 
observer-induced sources of apparent randomness. For 
one, the choice of representation used by the observer 
may render a system unpredictable. For example, rep- 
resenting a square wave in terms of sinusoids requires 
specifying an infinite number of amplitude coefficients. 
Truncating the order of approximation leads to errors, 
even for a source as simple and predictable as a square 
wave. Similarly, an observer's choice and design of its 
measuring instruments is an additional source of appar- 
ent randomness. As one example, Ref. Q shows how 
irreducible unpredictability arises from a measurement 
instrument's distortion of a spatio-temporal process's in- 
ternal states. 

Fourth, the measurement process engenders apparent 
randomness in other, perhaps more obvious ways, too. 
Even if one knows the equations of motion governing 
a system, accurate prediction may not be possible: the 
measurements made by an observer may be inaccurate, 
or, if the measurements are precise, there may be an 
insufficient volume of measurement data. Or, one may 
simply not have a sufficiently long measurement stream, 
for example, to disambiguate several internal states and, 
therefore, their individual consequences for the process's 
future behavior cannot be accurately accounted for. Ex- 
amples of these sorts of measurement-induced random- 
ness are considered in Refs. In all of these cases, 
the result is that the process appears more random than 
it actually is. 

Fifth, and finally, if the dynamics are sufficiently com- 
plicated it may simply be too computationally difficult 
to perform the calculations required to go from measure- 
ments of the system to a prediction of the system's future 
behavior. The existence of deeply complicated dynamics 
for which this was a problem was first appreciated by 
Poincare more than a century ago as part of his detailed 
analysis of the three-body problem ||^ . 

Of course, most natural phenomena involve, to one 
degree or another, almost all of these separate sources 
of "noise". Moreover, the different mechanisms interact 
with each other. It is no surprise, therefore, that describ- 
ing and quantifying the degree of a process's apparent 
randomness is a difficult yet essential endeavor that cuts 
across many disciplines. 



B. Untangling the Mechanisms 

A central goal here is to examine ways to untangle the 
different mechanisms responsible for apparent random- 
ness by investigating several of their signatures. As one 
step in addressing these issues, we analyze those aspects 
of apparent randomness over which an observer may have 
some control. These include the choice of how to quantify 
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the degree of randomness (e.g., through choices of statis- 
tic or in modehng representation) and how much data to 
collect. We describe the stance taken by the observer to- 
ward the process to be analyzed in terms of the measure- 
ment channel — an adaptation of Shannon's notion 
of a communication channel. One of the central ques- 
tions addressed in the following is, how does an observer, 
apprised of a process's possible states and its dynamics, 
come to know in what internal state the process is? We 
will show that this is related to another question, How 
does an observer come to accurately estimate how ran- 
dom a source is? In particular, we shall investigate how 
finite-data approximations converge to this asymptotic 
value. We shall see a variety of different convergence be- 
haviors and will present several different quantities that 
capture the nature of this convergence. As the title of 
this work suggests, we shall see that regularities that are 
unseen are "converted" to apparent randomness. 

It is important to emphasize, and this will be clear 
through our citations, that much of our narrative about 
levels of entropy convergence touches on and restates re- 
sults and intuitions known to a number of researchers in 
information theory, dynamical systems, stochastic pro- 
cesses, and symbolic dynamics. Our attempt here, in 
light of this, is several- fold. First, we put this knowledge 
into a single framework, using the language of discrete 
derivatives and integrals. We believe this approach uni- 
fies and clarifies a number of extant quantities. Second, 
and more importantly, by considering numerous exam- 
ples, we shall see that examining levels of entropy conver- 
gence can give important clues about the computational 
structure of a process. Finally, our view of entropy con- 
vergence will lead naturally to a new quantity, the tran- 
sient information T. We shall prove that the transient 
information captures the total uncertainty an observer 
must overcome in synchronizing to a Markov process. 

We begin in Sec. ^ by fixing notation and briefly re- 
viewing the motivation and basic quantities of informa- 
tion theory. In Sections |lll| and |^ we use discrete deriva- 
tives and integrals to examine entropy convergence. In 
so doing, we recover a number of familiar measures of 
randomness, predictability, and "complexity" . Then, in 
Sec. we introduce, motivate, and interpret a new in- 
formation theoretic measure of structure, the transient 
information. In particular, we shall see that the tran- 
sient information provides a quantitative measure of the 
manner in which an observer synchronizes to a source. 
We then illustrate the utility of the quantities discussed 
in Sec. p^-[v| by considering a series of increasingly rich 
examples in Sec. In Sec. VII we look at relationships 
between the quantities discussed previously. In particu- 
lar, we show several quantitative examples of how regu- 
larities that go undetected are converted i nto a pparent 
randomness. Finally, we conclude in Sec. VIII and of- 
fer thoughts on possible future directions for this line of 
research. 



II. INFORMATION THEORY 



A. The Measurement Channel 



In the late 1940's Claude Shannon founded the field of 
communication theory [ p^ , motivated in part by his work 
in cryptography during World War II [|ll| . His attempt 
to analyze the basic trade-offs in disguising information 
from third parties in ways that still allowed recovery by 
the intended receiver led to a study of how signals could 
be compressed and transmitted efficiently and error free. 
His basic conception was that of a communication chan- 
nel consisting of an information source which produces 
messages that are encoded and passed through a possi- 
bly noisy and error-prone channel. A receiver then de- 
codes the channel's output in order to recover the original 
messages. Shannon's main assumptions were that an in- 
formation source was described by a distribution over its 
possible messages and that, in particular, a message was 
"informative" according to how surprising or unlikely its 
occurrence was. 

We adapt Shannon's conception of a communication 
channel as follows: We assume that there is a process 
(source) that produces a data stream (message) — an in- 
finite string of symbols drawn from some finite alphabet. 
The task for the observer (receiver) is to estimate the 
probability distribution of sequences and, thereby, esti- 
mate how random the process is. Further, we assume 
that the observer does not know the process's structure; 
the range of its states and their transition structure — 
the process's internal dynamics — are hidden from the 
observer. (We will, however, occasionally relax this as- 
sumption below.) Since the observer does not have direct 
access to the source's internal, hidden states, we picture 
instead that the observer can estimate to arbitrary accu- 
racy the probability of measurement sequences. Thus, we 
do not address the eminently practical issue of how much 
data is required for accurate estimation of these proba- 
bilities. For this see, for example, Refs. [0JlJ,|l^. In our 
scenario, the observer detects sequence blocks directly 
and stores their probabilities as histograms. Though ap- 
parently quite natural in this setting, one should consider 
the histogram to be a particular class of representation 
for the source's internal structure — one that may or 
may not correctly capture that structure. 

This measurement channel scenario is illustrated in 
Fig. 0. In this case, the source is a three-state deter- 
ministic finite automaton. However, the observer does 
not see the internal states {A,B, C}. Instead, it has ac- 
cess to only the measurement symbols {0, 1} generated 
on state-to-state transitions by the hidden automaton. 
In this sense, the measurement channel acts like a com- 
munication channel; the channel maps from a internal- 
state sequence . . . BCBAACBC ... to a measurement 
sequence ...0111010.... The process shown in Fig. |l| 
belongs to the class of stochastic process known as hid- 
den Markov models. The transitions from internal state 
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to internal state are Markovian, in that the probabihty 
of a given transition depends only upon which state the 
process is currently in. However, these internal states 
are not seen by the observer — hence the name "hidden" 
Markov model 0jl5|. 

a number of issues arise for the 



Given this situation, 
observer. One fundamental question is how many of the 
system's properties can be inferred from the observed bi- 
nary data stream. In particular, can the observer build a 
model of the system that allows for accurate prediction? 
According to Shannon's coding theorem, success in an- 
swering these questions depends on whether the system's 
entropy rate falls below the measurement channel capac- 
ity. If it does, then the observer can build a model of 
the system. Conversely, if the entropy rate is above the 
measurement channel's capacity, then the theorem tells 
us that the observer cannot exactly reconstruct all prop- 
erties of the system. In this case, source messages — 
sequences over internal states — cannot be decoded in 
an error-free manner. In particular, optimal prediction 
will not be possible. In the following, we assume that 
the channel capacity is larger than the entropy rate and, 
hence, that optimal prediction is — in theory, at least — 
possible. 

Similar questions of building models from data pro- 
duced by various kinds of information sources are found 
in the fields of machine learning and computational learn- 
ing theory. See the appendices in Ref. for com- 
ments on the similarities and differences with the ap- 
proach taken here. 
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System Process Observer 

FIG. 1. The measurement channel: The internal states 
{A,B,C} of the system are reflected, only indirectly, in the 
observed measurement of I's and O's. An observer works with 
this impoverished data to build a model of the underlying sys- 
tem. After Ref. [|l7|. 



by a shift-invariant measure /i on infinite sequences 
• • • S-2S-1S0S1S2 ■ ■ ■ ; St (z A The measure fi induces 
a family of distributions, {Pr(st+i, . . . , si+l) : st £ A}, 
where Pr(st) denotes the probability that at time t 
the random variable St takes on the particular value 
St G A and Pr(sf+i, . . . , st+i) denotes the joint prob- 
ability over blocks of L consecutive symbols. We assume 
that the distribution is stationary; Pr(st_|_i, . . . , st+L) = 
Pr(si, . . . ,sl). 

We denote a block of L consecutive variables by = 
Si . . .Sl- We shall follow the convention that a capi- 
tal letter refers to a random variable, while a lowercase 
letter denotes a particular value of that variable. Thus, 
s^ = sqSi ■ ■ ■ sl-1, denotes a particular symbol block of 
length L. We shall use the term process to refer to the 

joint distribution Pr(5') over the infinite chain of vari- 
ables. A process, defined in this way, is what Shannon 
referred to as an information source. 

For use later on, we define several types of processes. 
First, and most simply, a process with a uniform distri- 
bution is one in which all sequences occur with equiprob- 
ability. We will denote this distribution by U^; 



U{s^) = 1/\A\' 



(2) 



Next, a process is independently and identically dis- 
tributed (IID) if the joint distribution Pr(5') = 
Pt{. . . , Si, Si+i, Si+2, Si+3, . . .) factors in the following 
way: 



Pr{S) = . . . Pr(5,)Pr(5.+i)Pr(5,+2) • ■ 



(3) 



and Pr(5,) = Pr(5j) for aU i,j. 

We shall call a process Markovian if the probability 
of the next symbol depends only on the previous symbol 
seen. In other words, the joint distribution factors in the 
following way: 



Pr(^) 



.Pr(5,+i|5,)Pr(5,+2|5,+i 



(4) 



B. Stationary Stochastic Processes 

The measurement streams we shall consider will be 
stationary stochastic processes. In this section we intro- 
duce this idea more formally, fix notation, and define a 
few classes of stochastic process to which we shall return 
when considering examples in Sec. VI. 

The main object of our attention will be a one- 
dimensional chain 



S — ■ . ■ S-2S-iSqSi . . . 



(1) 



of random variables St that range over a finite set 
A. We assume that the underlying system is described 



More generally, a process is order-R Markovian if the 
probability of the next symbol depends only on the pre- 
vious R symbols: 

Pr(5,|...,^,_3,S,_i) = Pr(5,|5,_ii,...,5,_i) . (5) 

Finally, a hidden Markov process consists of an internal 
order- i? Markov process that is observed only by a func- 
tion of its internal-state sequences. These are sometimes 
called functions of a Markov chain [^jl5| . We refer to 
all of these processes as finitary, since there is a well de- 
fined sense, discussed below, in which they have a finite 
amount of memory. 
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C. Basic Quantities of Information Theory 



D. Block Entropy and Entropy Rate 



Here, we briefly state the definitions and interpreta- 
tions of the basic quantities of information theory. For 
more details, see Ref. Let X be a random variable 
that assumes the values x G X, where X is a. finite set. 
We denote the probability that X assumes the particular 
value X by Pr(a;). Likewise, let y be a random variable 
that assumes the values y £ y. 

The Shannon entropy of X is defined by: 



H[X] 



Pr(x) log2 Pr(a;) 



(6) 



Note that H[X]>Q. The units oi H[X] are bits. The en- 
tropy H[X] measures the uncertainty associated with the 
random variable X. Equivalently, it measures the aver- 
age amount of memory, in bits, needed to store outcomes 
of the variable X. The conditional entropy is defined by 

H[X\Y] = - J2 Pr(a:,2/)log2Pr(a:|2/) , (7) 

and measures the average uncertainty associated with 
variable X, if we know Y. 

The mutual information between X and Y is defined 

as 



I[X;Y] = H[X]~H[X\Y] 



(8) 



In words, the mutual information is the average reduc- 
tion of uncertainty of one variable due to knowledge of 
another. If knowing Y on average makes one more cer- 
tain about X, then it makes sense to say that Y carries 
information about X. Note that I[X;Y] > and that 
/[X; F] — when either X and Y are independent (there 
is no "communication" between X and Y) or when either 
H[X] — or H[Y] ==0 (there is no information to share). 
Note also that I[X; Y] = I[Y; X]. 

The information gain between two distributions Pr(a;) 
and Pr(a;) is defined by: 



I?[Pr(2;)||Pr(a; )] = ^ Pr(x) log; 



Pt{x) 
Pr{x) 



(9) 



where Pr(2;) = only if Pr(x) = 0. Quantitatively, 
2?[P||Q] is the number of bits by which the two distri- 
butions P and Q differ [|l^. Informally, I?[P||(5] can be 
viewed as the distance between P and Q in a space of 
distributions. However, I?[P||(5] is not a metric, since it 
does not obey the triangle inequality. 

Similarly, the conditional entropy gain between two 
conditional distributions Pr(a;|y) and Pr(a;|y) is defined 
by: 



V[Pr{x\y)\\PT{x\y)]^ ^ Pr(a:, y) log^ 

xex^yey Pr(a^|y) 



(10) 



We now examine the behavior of the Shannon entropy 
H{L) of Pr(s^), the distribution over blocks of L con- 
secutive variables. We shall see that examining how the 
Shannon entropy of a block of variables grows with L 
leads to several quantities that capture aspects of a pro- 
cess's randomness and different features of its memory. 

The total Shannon entropy of length-L sequences is de- 
fined 



H{L) ^- J2 Pr(s^)log2Pr(s^) , 



(11) 



where i > 0. The sum is understood to run over all 
possible blocks of L consecutive symbols. If no mea- 
surements are made, there is nothing about which to 
be uncertain and, thus, we define H{Q) = 0. Below we 
will show that H{L) is a non-decreasing function of L; 
H{L) > H{L—1). We shall also see that it is concave; 
H{L) - 2H{L-l) + H{L-2) < 0. 



H(L) 



E 




FIG. 2. Total Shannon entropy growth for a finitary in- 
formation source: a schematic plot of H{L) versus L. H{L) 
increases monotonically and asymptotes to the line E + h^L, 
where E is the excess entropy and h^ is the source entropy 
rate. This dashed line is the E-memoryful Markovian source 
approximation to a source with entropy growth H{L). The 
entropy growth of the memoryless-source approximation of 
the source is indicated by the short-dashed line /i^L through 
the origin with slope h^. The shaded area is the transient 
information T. For more discussion, see text. 

Note that the maximum average information per obser- 
vation is log2 1^1, H{1) < \0g2\A\, and, more generally. 



H{L)<L\ogM\ 



(12) 



Equality in Eq. (^^ occurs only when the distribution 
over i-blocks is uniform; i.e., given by U^. Figure ^ 
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shows H{L) for a typical information source. The vari- 
ous labels and the interpretation of H{L) there will be 
discussed fully below. 

The source entropy rate is the rate of increase with 
respect to L of the total Shannon entropy in the large L 
limit: 

^ lim ^ , (13) 

where \x denotes the measure over infinite sequences that 
induces the L-block joint distribution Pr(s-^); the units 
are bits/symbol. The limit in Eq. ( p^ exists for all sta- 
tionary measures /i ||l9[| . The entropy rate quantifies 
the irreducible randomness in sequences produced by a 
source: the randomness that remains after the correla- 
tions and structures in longer and longer sequence blocks 
are taken into account. The entropy rate is also known as 
the thermodynamic entropy density in statistical mechan- 
ics or the metric entropy in dynamical systems theory. 

As Shannon proved in his original work, /i^ also mea- 
sures the length, in bits per symbol, of the optimal, 
uniquely decodable, binary encoding for the measure- 
ment sequence. That is, a message of L symbols requires 
(as L — oo) only h^L bits of information rather than 
log2|-4|L bits. This is consonant with the idea of ft.^ as a 
measure of randomness. On the one hand, a process that 
is highly random, and hence has large h^, is difficult to 
compress. On the other hand, a process with low has 
many correlations between symbols that can be exploited 
by an efficient coding scheme. 

As noted above, the limit in Eq. (|l3|) is guaranteed to 
exist for all stationary sources. In other words, 

H{L) - hf,L as L -> cx) . (14) 

However, knowing the value of /i^ indicates nothing 
about how H{L)/L approaches this limit. Moreover, 
there may be — and indeed usually are — sublinear terms 
in H{L). For example, one may have H[L) ~ c -I- h^^L 
or H{L) ~ logL 4- h^L. We shall see below that the 
sublinear terms in H{L) and the manner in which H{L) 
converges to its asymptotic form reveal important struc- 
tural information about a process. 



E. Redundancy 

Before moving on to our main task — considering what 
can be learned from looking at the entropy growth curve 
H{L) — we introduce one additional quantity from infor- 
mation theory. Since we are using an alphabet of size |^| , 
if nothing else is known about the process or the channel 
we can consider the measurement channel used to observe 
the process to have a channel capacity of C = logj]^]. 
Said another way, the maximum observable entropy rate 
for the channel output (the measurement sequence) is 
logal^l- 



Frequently, however, the observed is less than its 
maximum value. This difference is measured by the re- 
dundancy R: 

R = log2|^| -/i^ . (15) 

Note R > 0. If R > 0, then the series of random variables 
. . . ,Si, S'i+i, . . . has some degree of regularity: either the 
individual variables are biased in some way or there are 
correlations between them. Recall that the entropy rate 
measures the size, in bits per symbol, of the optimal bi- 
nary compression of the source. The redundancy, then, 
measures the amount by which a given source can be 
compressed. If a system is highly redundant, it can be 
compressed a great deal. 

For another interpretation of the redundancy, one can 
show that R is the information gain of the source's actual 
distribution Pr(s^) with respect to the uniform distribu- 
tion U(s^) in the L — > c» limit: 

R.lim»f»^, (16) 

where T> is defined in Eq. (^). Restated, then, the redun- 
dancy R is a measure of the information gained when 
an observer, expecting a uniform distribution, learns the 
actual distribution over the sequence. 



III. LEVELS OF ENTROPY CONVERGENCE: 
DERIVATIVES OF H{L) 

With these preliminaries out of the way, we are now 
ready to begin the main task: examining the growth of 
the entropy curve H{L). In particular, we shall look care- 
fully at the manner in which the block entropy H{L) con- 
verges to its asymptotic form — an issue that has occu- 
pied the attention of many researchers [pp[ p^jr^ , pO| -^ . 
In what follows, we present a systematic method for ex- 
amining entropy convergence. To do so, we will take dis- 
crete derivatives of H{L) and also form various integrals 
of these derivatives. This method allows one to recover 
a number of quantities that have been introduced some 
years ago and that can be interpreted as different aspects 
of a system's memory or structure. Additionally, our dis- 
crete derivative framework will lead us to define a new 
quantity, the transient information^ which may be inter- 
preted as a measure of how difficult it is to synchronize 
to a source, in a sense to be made precise below. 

Before continuing, we pause to note that the repre- 
sentation shown in the entropy growth curve of Fig. |^ 
of a finitary process is phenomenological, in the sense 
that H{L) and the other quantities indicated derive only 
from the observed distribution Pr(s^) over sequences. In 
particular, they do not require any additional or prior 
knowledge of the source and its internal structure. 
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A. Discrete Derivatives and Integrals 

We begin by briefly collecting some elementary prop- 
erties of discrete derivatives. Consider an arbitrary func- 
tion F : Z ^ R. In what follows, the function F will be 
the Shannon block entropy H{L), but for now we con- 
sider general functions. The discrete derivative is the 
linear operator defined by: 



(AF)(L) ^ F{L)-F{L-l) 



(17) 



The picture is that the operator A acts on F to produce 
a new function Ai^ which, when evaluated at L, yields 
F{L) — F{L—1). Higher-order derivatives are defined by 
composition: 



A"F = (Ao A"-i)F , 



(18) 



where A°F = F and n > 1. For example, the second 
discrete derivative is given by: 



A2f(L) = {AoA)F{L) 



F{L) - 2F{L-l) + F{L-2) 



(19) 
(20) 



One "integrates" a discrete function AF(L) by sum- 
ming: 



B 

Y,AF{L) = F{B)^F{A-1). 

L=A 

An integration-by-parts formula also holds: 

B 



(21) 



^ LAF{L) = BF{B) - 

B-l 

AF{A-l) - 5]F(L). 



(22) 



Note the shift in the sum's limits on the right-hand side. 



B. AH{L): Entropy Gain 

We now consider the effects of applying the discrete 
derivative operator A to the entropy growth curve H{L). 
We begin with the first derivative of H{L): 



AH{L) = H{L) - H{L - 1) , 



(23) 



where L > 0. The units of AH{L) are bits / symbol. A 
plot of a typical AH{L) vs. L is shown in Fig. |^. We 
refer to AH{L) as the entropy gain for obvious reasons. 

If a measurement has not yet been made, the appar- 
ent entropy rate is maximal. Thus, we define AH{0) = 
log2|^|. In a Bayesian modeling setting this is equivalent 
to being told only that the source has \A\ symbols and 
then assuming the process is independent identically dis- 
tributed and uniformly distributed over individual sym- 
bols. 



Having made a single measurement in each experi- 
ment in an ensemble or, equivalently, only looking at 
single-symbol statistics in one experiment, the entropy 
gain is the single-symbol Shannon entropy: AH{1) = 
H(l) - Hifi) = H{1), since we defined H{0) = 0. 



log^\si\ 



H(l) 



AH 




FIG. 3. Entropy-rate convergence: A schematic plot of 
h^{L) = AH{L) versus L using the finitary process's H(L) 
shown in Fig. ^. The entropy rate asymptote /ip is indicated 
by the lower horizontal dashed line. The shaded area is the 
excess entropy E. 

Let's now look at some properties of AH{L). 
Proposition 1 AH{L) is an information gain: 

AH{L) ^ I?[Pr(s^)||Pr(s^-i)] , (24) 



where L > 1. 

Proof. Since many of the proofs are straightforward, di- 
rect calculations, we have put most of them in Appendix 
^ so as not to interrupt the flow of ideas in the main 
sections. Proposition yjis proved in App. A 1 . 



□ 



Note that Eq. ( |24| ) is a slightly different form for the 
information gain than that defined in Eq. (|^). Unlike 
Eq. (^), in Eq. (|4|) the two distributions do not have 
the same support: one {s^} is a refinement of the other 
{s^~^}. When this is the case, we extend the length L — 1 
distribution to a distribution over length L sequences by 
concatenating the symbols sl_i with equal probability 
onto So, ... , sl-2- We then sum the terms in V over the 
set of length L sequences. 

Note that since the information gain is a non-negative 
quantity |19|, it follows from Prop. |^ that AH{L) = 
H{L) — HJTj—1) > 0, as remarked earlier. In a sub- 
sequent section, we shall see that A^H{L) < 0; hence, 
AH{L) is monotone decreasing. 

The derivative AH{L) may also be written as a con- 
ditional entropy. Since 



= Pr(sL|s ) 



Pr(s^-i) 



(25) 
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it immediately follows from Eq. ( p3| ) that 



C. Entropy Gain and Redundancy 



(26) 



This observation helps strengthen our interpretation of 
/i^j. Recall that the entropy rate was defined in 
Eq. ( |l3|) as limL^oo H{L)/ L. As is well known (see, e.g., 
Ref. p9|), the entropy rate may also be written as: 



/i^ = lim H[Sl\S^-^] 

L— i-oc 



(27) 



That is, is the average uncertainty of the variable S'l, 
given that an arbitrarily large number of preceding sym- 
bols have been seen. 

By virtue of Eq. (p6|), we see that 



lim Ai/(i) 



Following Refs. 
K{L): 



(28) 

|,|0|, we denote Ai7(i) by 



h^{L) = AH{L) 



(29) 



= H{L) - H{L-l) , L > 1 



The function h^{L) is the estimate of how random the 
source appears if only blocks of variables up to length 
L are considered. Thus, hfj,{L) may be thought of as 
a finite-L approximation to the entropy rate — the 
apparent entropy rate at length L. Alternatively, the en- 
tropy rate /i^ can be estimated for finite L by appealing 
to its original definition i.e., Eq. (p^. We thus de- 
fine another finite-L entropy rate estimate: 



h'JL) 



H{L) 



L > 1 



(30) 



where we also take h'^{Q) = log2-4. Note that while we 
have 



lim h'^{L) = lim h^{L) , 

in general, it is the case that 

h'^{L) ^ h^{L) , L < oo . 



(31) 



(32) 



Moreover, h'^{L) converges more slowly than h^{L). 
(The examples later illustrate the slow convergence.) 



Lemma 1 



h'AL) > h,iL) > h, 



(33) 



Proof. See App. A8. □ 



The entropy gain can also be interpreted as a type of 
redundancy. To see this, first recall that the redundancy, 
Eq. ([l^), is the difference between logj and h^, where 
log2 \A\ is the entropy given no knowledge of the source 
apart from the alphabet size, and is the entropy of 
the source given knowledge of the distribution of arbi- 
trarily large L-blocks. But what is the redundancy if the 
observer already knows the actual distribution Pr(s^) of 
words up to length L7 

This question is answered by the L-redundancy: 



R(L) = H{L) - h^L 



(34) 



Here, H{L) is the entropy given that Pr(s^) is known, 
and the product h^L is the entropy of an L-block if one 
uses only the asymptotic form of H{L) given in Eq. (|l4|). 
Note that R(L) < R, where R is defined in Eq. (15) 
We now define the per-symbol L-redundancy: 



r(i) = AR(L) = h^{L) - . 



(35) 



The quantity r(L) gives the difference between the per- 
symbol entropy conditioned on L measurements and the 
per-symbol entropy conditioned on an infinite number of 
measurements. In other words, r(L) measures the ex- 
tent to which the length-L entropy rate estimate exceeds 
the actual per-symbol entropy. Any difference indicates 
that there is redundant information in the i-blocks in 
the amount of r(L) bits. Ebeling refers to r{L) as 
the local (i.e., L-dependent) predictability. 



D. A'H{L): Predictability Gain 

If we interpret h^{L) as an estimate of the source's 
unpredictability and recall that it decreases monotoni- 
cally to hfj_, we can look at A^H{L) — the rate of change 
of hfj^(L) — as the rate at which unpredictability is lost. 
Equivalently, we can view —A'^H{L) as the improvement 
in our predictions in going from L — 1 to L blocks. This 
is the change in the entropy rate estimate hfj^(L) and is 
given by the predictability gain: 



A^H{L) = Ah,,{L) = h,,{L) - h^{L-l) 



(36) 



where L > 0; the units of A'^H{L) are bits/symbol^. (See 
Fig. ^) Since we defined hfj_{0) = \0g2\A\, we have that 



A^H{l)=H{l)-log,\A\ 



(37) 



The quantity A^i?(0) is not defined. 

A large value of \A'^H{L)\ indicates that going from 
statistics over (L— l)-blocks to L-blocks reduces the un- 
certainty by a large amount. Speaking loosely, we shall 
see in Sec. VI that a large value of \A^H{L)\ suggests 
that the L^^ measurement is particularly informative. 
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Proposition 2 lS?H{L) is a conditional information 
gain: 

A'H{L) = -V[PrisL-i\s'^-')\\PTisL-2\s''-')] , (38) 
for L>3. 

Proof: See App. □ 

1 L 




H(l) _ 



FIG. 4. Predictability gain: A schematic plot of A'^H{L) 
versus L using the "typical" hf^{L) shown in Fig. ^. The 
shaded area is the total predictability G. 

Since the information gain is non-negative, it follows from 
Prop. ||that lS?H{L) < and so H{L) is a concave func- 
tion of L. 

The observation contained in Prop. ^ first appeared in 
Refs. [H and Q. There, -/S?H{L) is referred to as 
the correlation information. However, we feel that the 
term "predictability gain" is a more accurate name for 
this quantity. The quantity —A^H{L) measures the re- 
duction in per-symbol uncertainty in going from (L— 1)- 
to i-block statistics. While — A^iJ(L) is related to the 
correlation between symbols L time steps apart, it does 
not directly measure their correlation. The information 
theoretic analog of the two-variable correlation function 
is the mutual information between symbols L steps apart: 
I[St', St+i], averaged over t. For a discussion of two- 
symbol mutual information and how they compare with 
correlation functions, see Refs. MM and E5|. 



E. Entropy-Derivative Limits 

Ultimately, we are interested in how H{L) and its 
derivatives converge to their asymptotic values. As we 
will now show, this question is well posed because the 
derivatives of H{L) have well defined limiting behav- 
ior. First, as mentioned above, for stationary sources, 
ImiL^oo AH{L) = hfj_. An immediate consequence of 
this is is the following. 

Lemma 2 For stationary processes, the higher deriva- 
tives of H{L) vanish in the L — > oo limit: 



lim A''H{L) = 0, n > 2 



(39) 



Proof. To see this, first recall that the limit /i^ = 
limi„>oo AH{L) exists for a stationary source |]l9| and 
so the sequence AH (0), AH (l), AH (2), . . . converges. It 
follows from this that limL^oo[AH{L) - AH{L - 1)] = 
lim^^oo A^H(L) = 0. This proves the n ~ 2 case of 



Eq. (p9|). The n > 3 cases of Eq. (^ then follow via 
identical arguments. □ 

To recapitulate, for the finitary processes we are con- 
sidering in the L —t oo limit we have that 

H{L) ^ h^,L , (40) 

plus possible sublinear terms. We also have that 



and 



lim AH{L) 

L— *oo 



lim A"iJ(L) = , for n > 2 . 



(41) 



(42) 



IV. ENTROPY CONVERGENCE INTEGRALS 



Since limits at each level of the entropy-derivative hi- 
erarchy exist, we can ask how the derivatives converge to 
their limits by investigating the following "integrals" : 



oo 

X„ = y lA^HiL) - lim A"H{L) 



(43) 



L = Ln 



The lower limit is taken to be the first value of L 
at which A"H{L) is defined. The picture here is that 
at each L, A"H{L) over- or under-estimates the asymp- 
totic value limL^oo A"-H{L) by an amount A"H{L) — 
limi^oo A"iJ(L). Summing up all of these estimates 
provides a measure, perhaps somewhat coarse, of the 
manner in which an entropy derivative converges to its 
asymptotic value. The larger the sum, the slower the 
convergence. The latter, in turn, indicates correlations 
within larger L sequences, thus suggesting that a process 
possesses a greater degree of structure — internal struc- 
ture that is responsible for maintaining the correlations. 



A. Predictability 

We first examine T2. Recall that limi^oo A^H{L) = 
and that A^H{L) is defined for L > 1. For reasons that 
will become clear shortly, we refer to I2 as the total pre- 
dictability G. It is defined as: 



(44) 



L=l 
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Geometrically, G is the area under the A'^H{L) curve, as 
shown in Fig. |j. The units of G are bits/symbol, as may 
be inferred geometrically from Fig. ^, where the units of 
the horizontal axis are bits and those of the vertical axis 
are bits/symbol^. Alternatively, this observation follows 
directly from Eq. (^^, when one takes into account the 
implied AL {— 1) in the sum. An interpretation of G is 
established by the following result. 



Proposition 3 The magnitude of the total pre- 

dictability is equal to the redundancy, Eq. ^^): 



G = -R. 



(45) 



Proof: See App. |AJ. □ 

This establishes an accounting of the maximum possi- 
ble information log2|^| available from the measurement 
channel in terms of intrinsic randomness hp, and total 
predictability G: 



log2|^| = \G\ + h, 



(46) 



That is, the raw information logjI-Aj obtained when mak- 
ing a single-symbol measurement can be considered to 
consist of two kinds of information: that due to ran- 
domness hp, on the one hand, and that due to order or 
redundancy in the process G, on the other hand. 

Alternatively, we see that G = log2|.4| — h^. Thus, 
viewing h^ as measuring the unpredictable component 
of a process, and recalling that logjl^l is the maximum 
possible entropy per symbol, it follows that G measures 
is the source's predictable component. For this reason 
we refer to G as the total predictability. Note that this 
result turns on defining the appropriate boundary condi- 
tion as hp{0) = log2|.4|. 

There is another form for G that provides an additional 
interpretation. The total predictability can be expressed 
as an average number of measurement symbols, or aver- 
age length, where the average is weighted by the third 
derivative, A^H{L). 

Proposition 4 The total predictability can be expressed 
as a type of average length, where the average is weighted 
by the third derivative, A^H{L). 



(47) 



L=2 

when the sum is finite. 



Proof: See App. |AJ. □ 

Eq. ( |47| ) shows that if A^H{L) is slow to converge to 
0, then G will be large. Ignoring dimensional consid- 
erations, G can be viewed as an average length, since 
Eq. ( p7| ) expresses G as L averaged by A^H{L). (Note, 
however, that G is not a correlation length; a correlation 



length is typically defined as the L at which a correla- 
tion function has decayed to 1/e of its maximum.) Al- 
ternatively, G can be viewed as an average of A^H{L), 
weighted by L. 

Speaking informally, G could be viewed as a measure 
of "disequilibrium", since it measures the difference be- 
tween the actual entropy rate h^ and the maximum pos- 
sible entropy rate log2|^|. The extent to which h^ falls 
below the maximum measures the deviation from uni- 
form probability, which some authors have interpreted 
as an equilibrium condition. In this vein, several have 
proposed complexity measures based on multiplying G 
hy hp 47|. However, we and others have shown that 
this type of complexity measure fails to capture structure 
or memory, since they are only a function of disorder hp 
p8|-^0| . For additional critiques of this type of complex- 
ity measure, see Refs. pl| , |5l| . 

Finally, note that for any periodic process, G = 
log2|y^|, since hp = 0. The total predictability assumes 
its maximum value for a completely predictable process. 
However, G does not tell us how difficult it is to carry 
out this prediction, nor how many symbols must be ob- 
served before the process can be optimally predicted. To 
capture these properties of the system, we need to look 
at other entropy convergence integrals. 



B. Excess Entropy 



Having looked at how A'^H{L) converges to 0, we now 
ask: How does AH{L) = hp{L) converge to hp? One 
answer to this question is provided by Xi. For reasons 
that will be discussed below, we refer to Xi as the excess 
entropy E: 



E 



Xi - J2^hpiL)-hp] 



(48) 



L = l 



The units of E are bits. We may view E graphically as 
the area indicated in the entropy-rate convergence plot 
of Fig. 1^. For now, let us assume that the above sum is 
finite. For many cases of interest, however, this assump- 
tion turns out to not be correct; a point to which we shall 
return at the end of this section. 

The excess entropy has a number of different in- 
terpretations, which will be discussed below. Ex- 
cess entropy also goes by a variety of different 
names. References |21,^,Q use the term "excess en- 
tropy " . Re ference [12| uses "stored information" and 
4^, 4l| use "effective measure complexity" . 
52l refer to the excess entropy simply as 
References IslJ^ refer to the excess entropy 
In Refs. |E6|,^, the excess 



Refs. Qlljal 

References 



"complexity" . 

as "predictive information" . 
entropy is called the "reduced Renyi entropy of order 1" . 
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i. E as predictability-gain-weighted length 
Proposition 5 The excess entropy may also he written 



E = -J2{L-1)A^H{L) 



(49) 



L=2 



[|l|,||j2l]j2|,|2|,|^,||,||. Calling this sum intrinsic re- 
dundancy, we have the following result. 

Proposition 6 The excess entropy is the intrinsic re- 
dundancy of the source: 



E = ^r(L) 



(50) 



L=l 



Proof: See App. |AJ. □ 

Eq. ( ^Of ) shows that E may also be viewed as an av- 
erage L, weighted by the predictability gain A^iJ(L), a 
view emphasized in Ref. [Q. However, this is not a di- 
mensionally consistent interpretation, since E has units 
of bits. Alternatively, Eq. (^) shows that the excess en- 
tropy can be seen as an average of A^H{L), weighted by 
the block- length L. 



2. E as intrinsic redundancy 

The length-L approximation hf^(L) typically overesti- 
mates the entropy rate at finite L. Specifically, h^{L) 
overestimates the latter by an amount h^{L) — that 
measures how much more random single measurements 
appear knowing the finite L-block statistics than know- 
ing the statistics of infinite sequences. In other words, 
this excess randomness tells us how much additional in- 
formation must be gained about the sequences in order 
to reveal the actual per-symbol uncertainty h^. This 



Proof. This follows directly from inserting the definition 
of intrinsic redundancy, Eq. (^, in Eq. (|4|). □ 

The next proposition establishes a geometric interpre- 
tation of E and an asymptotic form for H{L). 

Proposition 7 The excess entropy is the subextensive 
part of H{L): 



E = lim [H{L) - h^L] 



(51) 



Proof See App. A6. □ 



This proposition implies the following asymptotic form 
for H{L): 



H{L) 



E 



as L 



oo . 



(52) 



Thus, we see that E is the L = intercept of the linear 



function Eq. (|52| ) to which H(L) asymptotes. This obser- 
vation, also made in Refs. [pp, p^ , [T3| , p7t , is shown graph- 
ically in Fig. |. Note that E > 0, since H{L) > h^L. 



merely restates the fact that the difference hfj_{L) — Note also that if /i^ — 0, then E = MuiL^oa H{L) 



is the per-symbol redundancy r(L), defined originally in 
Eq. (^5|). Though the source appears more random at 
length L by the amount r(L), this amount is also the 
information-carrying capacity in the L-blocks that is not 
actually random, but is due instead to correlations. We 
conclude that entropy-rate convergence is controlled by 
this redundancy in the source. Presumably, this redun- 
dancy is related to structures and memory intrinsic to 
the process. However, specifying how this memory is 
organized cannot be done within the framework of infor- 
mation theory; a more structural approach based on the 
theory of computation must be used. We return to the 
latter in the conclusion. 

There are many ways in which the finite-L approxima- 
tions h^{L) can converge to their asymptotic value /i^. 
(Recall Fig. Fixing the values of H{1) and h^, for ex- 
ample, does not determine the form of the hf^{L) curve. 
At each L we obtain additional information about how 
hfj_{L) converges, information not contained in the values 
of H{L) and h^{L) at smaller L. Thus, roughly speaking, 
each h^{L) is an independent indicator of the manner by 
which h^{L) converges to /i^j. 

Since each increment hfj^{L) — h^^ is an indepen- 
dent contribution in the sense just described, one sums 
up the individual per-symbol L-redundancies to ob- 
tain the total amount of apparent memory in a source 



A useful consequence of Prop. is that it leads one to 
use Eq. d52) instead of the original (very simple) scaling 
of Eq. (|lj)7 Later sections address how ignoring Eq. ( |52| ) 
leads to erroneous conclusions about a process's unpre- 
dictability and structure. 



3. E as mutual information 

Yet another way to understand excess entropy is 
through its expression as a mutual information. 

Proposition 8 The excess entropy is the mutual infor- 
mation between the left and right (past and future) semi- 
infinite halves of the chain S ■ 



E 



lim /[5'o5'i ■ ■ • S'2L-i; '5'2l5'2l+i>5'2l-i] ■ (53) 



when the limit exists. 



Proof See App. |A7|. □ 

Note that E is not a two-symbol mutual information, 
but is instead the mutual information between two semi- 
infinite blocks of variables. 

Eq. (|5^) says that E measures the amount of historical 
information stored in the present that is communicated 
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to the future. For a discussion of some of the subtleties 
associated with this interpretation, however, see Ref. [^ . 

Prop. H also shows that E can be interpreted as the cost 
of amnesia: If one suddenly loses track of a source, so 
that it cannot be predicted at an error level determined 
by the entropy rate /i^, then the entire string appears 
more random by a total of E bits. 



Proposition 11 For an order-R Markovian process, the 
excess entropy is given by 

E = H{R) - Rhf, . (57) 

Recall that an order-R Markovian process was defined in 
Eq. §). 



4- Finitary processes 

We have argued above that the excess entropy E pro- 
vides a measure of one kind of memory. Thus, we refer 
to those processes with a finite excess entropy as finite- 
memory sources or, simply, finitary processes, and those 
with infinite memory, infinitary processes. 



Proof: This result will be proved in Sec. VI C| , when 
we consider an example Markovian process. Also, see 
Refs. lo), m, and @. □ 

For finitary processes that are not finite-order Marko- 
vian, the entropy-rate estimate /i^ (L) often decays expo- 
nentially to the entropy rate h^: 



Definition 1 A process is finitary if its excess entropy 
is finite. 



Definition 2 A process is infinitary if its excess entropy 
is infinite. 



Proposition 9 For finitary processes the entropy-rate 
estimate h^{L) decays faster than 1/L to the entropy rate 



That 



IS, 



h^{L) -hf_, < 



A 
L 



(54) 



for large L and where A is a constant. For infinitary 
processes h^{L) decays at or slower than 1/L. 

Proof. By direct inspection of Eq. (^8|). □ 

One consequence is that the entropy growth for finitary 
processes scales as H{L) ~ E + h^L in the L ^ oo limit, 
where E is a constant, independent of L. In contrast, an 
infinitary process might scale as 



H{L) - ci +C2logL 



(55) 



A2-^^ 



(58) 



for large L and where 7 and A are constants. 

Exponential decay was first observed for various kinds 
of one-dimensional map of the interval and a scal- 
ing theory was developed based on that ansatz [^ . 
Later Eq. ( |58| ) was proven to hold for one-dimensional, 
fully chaotic maps with a unique invariant ergodic mea- 
sure that is absolutely continuous with respect to the 
Lebesgue measure Q. To our knowledge, there is not 
a direct proof of exponential decay for more general fini- 
tary processes. There is, however, a large amount of 
empirical evidence suggesting this form of convergence 
Jl3| , p2 25 , 2^]. Nevertheless, several lines of reasoning sug- 
gest that exponential decay is typical and to be expected. 
For further discussion, see Appendix 



Corollary 1 For exponential-decay finitary processes the 
excess entropy is given by 



E « E., 



H{l)~h, 
1 - 2--^ 



(59) 



where ci and C2 are constants. For such a system, the ex- 
cess entropy E diverges logarithmically and h^{L) — /i^ ~ 



In Sec. VI we shall determine E, /i^, and related quan- 



tities for several finitary sources and one infinitary source. 
There are, however, a few particularly simple classes of 
finitary process for which one can obtain general expres- 
sions for E, which we state here before continuing. 

Proposition 10 For a periodic process of period p, the 
excess entropy is given by 



log2 P 



(56) 



Proof: One observes that H{L) = logjP, for L > p. □ 



where 7 is the decay exponent of Eq. |5j and H{1) is the 
single-symbol entropy. 



Proof: One directly calculates the area between two 
curves in the entropy convergence plot of Fig. |^. The 
first is the constant line at /i^. The second is the 
curve specified by Eq. ^ with the boundary condition 
^i^ii^) = -ff (1). Alternatively, Eq. ( p8| ) may be inserted 
into Eq. (|49|); Eq. ( |59| ) then follows after a few steps of 
algebra. □ 

Note that Eq. ( |59| ) is an approximate result; it is exact 
only if Eq. ( pq ) holds for all L. In practice, for small L 
hfj_{L) — hfj^ is larger than its asymptotic form A2~^^ and, 
thus, E-y gives an upper bound on E. 
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5. Finite-L expressions for E 

There are at least two different ways to estimate the 
excess entropy E for finite L. First, we have the partial- 
sum estimate given by 



E(L) = H{L) - Lh^iL) 



J2 KiM)-h,{L)] 

M=l 



(60) 



The second equality follows immediately from the inte- 
gration formula, Eq. (pl|), and the boundary condition 
H{0) = 0. 

Alternatively, a finite-L excess entropy can be defined 
as the mutual information between L/2-blocks: 

E'(L) = I[SaSi ■ ■ ■ Sl/2Sl/2+iSl-i] , (61) 

for L even. If L is odd, we define E'(L) = E'(i- 1). The 
expression in Eq. (|6l]), however, is not as good an esti- 
mator of E as that of equation Eq. (|60|), as established 
by the following lemma: 



Lemma 3 



E'(L) < E(L) < E 



(62) 



Proof: See App. AO. □ 



V. TRANSIENT INFORMATION 

Thus far, we have discussed derivatives of the entropy 
growth curve H(L), and we have also defined and inter- 
preted two integrals: the total predictability G and the 
excess entropy E. Both G and E have been introduced 
previously by a number of authors. 

In this section, however, we introduce a new quantity, 
by following the same line of reasoning that led us to 
the total predictability, G = X2 , and the excess entropy, 
E = Ii. That is, we ask: How does H{L) converge to its 
asymptote E -I- hf^Ll The answer to this question is pro- 
vided by To- For reasons that will become clear below, 
we shall call —Xq the transient information T: 



T EEE -Xo = ^ [E - 



H{L)] 



(63) 



L=0 



Note that the units of T are hits x symbols. 

The following result establishes an interpretation of T. 

Proposition 12 The transient information may be writ- 
ten as: 



(64) 



L=l 



Proof: The proof is a straightforward calculation, how- 
ever, since it is a new result, we include it here. We begin 
by writing the right-hand side of Eq. (|6j) as a partial 
sum: 



L = l 



Y,L[h^{L)-h^] = 

M 

lim V[LAi7(L) - /i^L] 

Using Eq. (^), this becomes: 

00 



(65) 



M 



(66) 



L = l 



Using MH{M) = T,l=o ^[M) and limA/-.oo H{M) = 
E -I- hf^M, and rearranging slightly, we have: 



Y,L[h^{L)^h,] = 

Jini^ <! E [E - H{L)] + h,M - Y KL \ ■ (67) 



M-l 



A/ 



L=0 



L=0 



But, 



M-l 



M 



J2 KM - Y KL 



L=Q 



h 



M2 



-Af(Af + 1) 



1 

'2' 



= h^-M{M-l) 



^J2kl- 

Using this last line in Eq. (^^, we have 

00 

L 

lim \Yi^ + KL-HiL)]\ 
I Z^o J 



(68) 
(69) 
(70) 



L=l 



(71) 



The right-hand side of the above equation is T, complet- 
ing the proof. □ 

Recall that E + h^^L is the entro py grow th curve for a 
finitary process, as discussed in Sec. [VB4. Thus, T may 



be viewed as a sum of redundancies, (E + /i^i) — H{L), 
between the source's actual entropy growth H{L) and 
the E -|- h^L finitary-process approximation. 
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A. T and Synchronization Information 

For finitary processes H{L) scales as E + h^^L for large 
L. When this scaling form is reached, we say that the 
observer is synchronized to the process. In other words, 
when 

T{L) = E + h^L- H{L) = , (72) 

we say the observer is synchronized at length-L se- 
quences. As we will see below, observer-process synchro- 
nization corresponds to the observer being in a condition 
of knowledge such that it can predict the process outputs 
at an error rate determined by the process's entropy rate 

On average, how much information must an observer 
extract from measurements so that it is synchronized to 
the process in the sense described above? As argued in 
the previous section, an answer to this question is given 
by the transient information T. 

We now establish a direct relationship between the 
transient information T and the amount of information 
required for observer synchronization to block-Markovian 
processes. We begin by stating the question of ob- 
server synchronization information theoretically and fix- 
ing some notation. 

Assume that the observer has a correct model M = 
{V, r} of a process, where V is a set of states and T the 
rule governing transitions between states. That is, T is 
a matrix whose components Tab give the probability of 
making a transition to state B, given that the system is in 
state yl, where A,B & Contrary to the scenario shown 
in Fig. n^, in this section we assume that the observer di- 
rectly measures the process's states. That is, we have a 
Markov process, rather than a hidden Markov process. 

The task for the observer is to make observations and 
determine in which state u € V the process is. Once the 
observer knows with certainty in which state the process 
is, the observer is synchronized to the source and the 
average per-symbol uncertainty is exactly h^. We are in- 
terested in describing how difficult it is to synchronize to 
a directly observed Markov process. 

The observer's knowledge of V is given by a distribu- 
tion over the states v Let Pr(w|s^,A^) denote the 
distribution over V given that the particular sequence of 
symbols has been observed. The entropy of this dis- 
tribution over the states measures the observer's average 
uncertainty in w g V: 

H[Vv{v\s^,M)] = -^Pr(w|s-^,7W)log2Pr(w|s-^,7W) . 

t)6V 

(73) 

Averaging this uncertainty over the possible length-L ob- 
servations, we obtain the average state-uncertainty: 

n{L) ^ 

-^Pr(s^)^Pr(«|s^,M)log2Pr(i;|s^,X) . (74) 



The quantity Ti.{L) can be used as a criterion for syn- 
chronization. The observer is synchronized to the source 
when 7i(i) =0 — that is, when the observer is com- 
pletely certain about in which state v € V the mechanism 
generating the sequence is. And thus, when the condi- 
tion in Eq. ^1% ) is met, we see that 7i(L) = 0, and the 
uncertainty associated with the prediction of the model 
M is exactly /i^. 

While the observer is still unsynchronized, though, 
Ti-iL) > 0. We refer to the average total uncertainty 
experienced by an observer during the synchronization 
process as the synchronization information S: 

oo 

S^5]H(L). (75) 

The synchronization information measures the average 
total information that must be extracted from measure- 
ments so that the observer is synchronized. 

In the following, we assume that our model is Marko- 
vian of order R. Additionally, we assume that the set of 
Markovian states V is associated with the j^-'^l possible 
values of R consecutive symbols; henceforth the latter are 
referred to as R-blocks. Specifically, there is a one-to-one 
correspondence between the states v and the i?-blocks, 
and hence there exists a one-to-one, invertible function 
tp : s^ ^ V. This function ip enables us to move back 
and forth between the states v and the i?-blocks. For 
example, we may use ip to rewrite the set of states: 

V = {(p(siS2 ■■■SR):s,eA,l<i<R}. (76) 

The matrix T gives the transition probabilities between 
symbol blocks. Note that the Markovian states are "slid- 
ing" in the sense that a transition from one state to an- 
other corresponds to a transition from, say, symbol block 
sqSi • • ■ sr-i to siS2 ■ ■ • Sr. Thus, it is not hard to see 
that the transition matrix T is sparse; there are at most 
nonzero entries in the \A^' x matrix T. 

Theorem 1 For a block-Markovian process, the synchro- 
nization information S is given by: 

S = T + ^R{R+l)h^ . (77) 



Proof: See App. ^.D 

Thus, the transient information T, together with the 
entropy rate ft.^ and the order R of the Markov process, 
measures how difficult it is to synchronize to a process. 
If a system has a large T, then, on average, an observer 
will be highly uncertain about the internal state of the 
process while synchronizing to it. The transient informa- 
tion measures a structural property of the system — a 
property not captured by the excess entropy E. 
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Corollary 2 For periodic processes, S = T. 

Proof: For periodic processes, — 0. Plugging this in 
to Eq. ( |7^), the corollary follows. □ 

In Sec. VI B 2 we shall see that, while the excess en- 



tropy is the same (logjp) for all period p processes, the 
transient informations are different. Thus, the transient 
information allows one to draw structural distinctions be- 
tween different periodic sequences. 

Corollary 3 For exponential- decay finitary processes we 
have 



T7 



H{1) 



(1-2-1)^ 



(78) 



Proof: Inserting Eq. (JS^) into the expression for T given 
in Eq. (|^, the result follows after several steps. □ 



Combining Eqs. ( p9D and ([7 8]), we arrive at an exact 
relationship between the approximate expressions for the 
excess entropy and the transient information: 



E2 



(79) 



B. Summary 

This completes our exposition of entropy convergence 
and our method of differentiating and integrating H{L) 
to move between levels. Table | summarizes the first lev- 
els of the entropy convergence hierarchy as investigated 
in the preceding sections. 



Entropy-Convergence Hierarchy 


Level 


Derivatives 




L ^ oo Limit Integrals 










At Level n 


From Level n 



1 
2 


H{L) 
AH{L) 


Lo = Q 
Li = 
L2 = 1 


oo or log^p T = El-o [E + KL^ H{L)] 

K E = Er=i[Ai/(i)-M 


T = Er=i(i)[AH(L)-ft,] 

E = -ErL2(^-i)A'^(^) 

G = -ErL2(i-l)A'^(i) 


n 




Ln — n — 1 


= E?Ll„ [A"//(L) - limi_oo A"H{L)\ 


-ErLi„(i-i)A"+iir(L) 



TABLE I. Moving up and down the first levels of entropy convergence. 



VI. EXAMPLES 

This section analyzes several variously structured pro- 
cesses to illustrate a range of different entropy conver- 
gence behaviors. The results demonstrate what the pre- 
ceding quantities — such as, the entropy rate, the excess 
entropy and the transient information — do and do not 
indicate about a process's organization. 



A. Independent, Identically Distributed Processes 

We begin with the simplest stochastic process: binary 
variables independently and identically distributed (HD), 
as in Eq. (^. Figure ^ shows the entropy growth curve 
H{L) for two IID processes: a fair coin and a biased coin 
with a bias of 0.7. 

For both coins H{L) grows linearly. Hence, AH{L) is 
constant for these and all other IID processes. Note, 
however, that the two systems have different entropy 
rates /i^. The fair coin has an of 1 bit per sym- 
bol, while the biased coin, being less unpredictable, has 
« .8813. As a result, from Eq. (^) the total pre- 



dictability G = log2 \A\ — hf_i = bits for the fair coin 
and 0.1187 bits for the biased coin. The predictability of 
each process is rather low, as expected. 




FIG. 5. Entropy growth for IID processes: a fair coin (solid 
line) and a coin (dashed line) with bias p = 0.7. 

As is clear from Fig. |[ for both processes the excess en- 
tropy E and the transient information T are zero. This 
makes sense in light of the interpretations of E and T 
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given in the previous sections. Each coin flip does not 
depend on past flips, and so there is no mutual infor- 
mation between the past and the future. Thus, E = 0. 
Similarly, no information is needed to synchronize to the 
source — H{L) assumes its asymptotic form at L — 1 — 
and so T = 0. That is, the statistics of isolated flips are 
all that is required to optimally predict both processes. 
Historical information does not improve predictability. 




2 4 6 8 10 12 14 16 18 
L 



(b) 

FIG. 6. Entropy curves for the period-16 process: 
• • • (1010111011101110)°° • • •. (a) Entropy growth (sohd line) 
and E + /i^L (dashed line), (b) Entropy convergence for the 
two estimators /i^t(i) (solid line) and h'^^{L) (dotted line). 

B. Periodic Processes 

1. A period-16 process 

We now consider periodic processes. We begin with a 
period-16 process, whose H{L) is shown in Fig. |^(a). The 
sequence consists of repetitions of the length-16 block 
file ^ 1010111011101110. In Fig. |(b) we show the con- 
vergence of entropy density estimates to the asymptotic 
value, hfj^ = 0. As for all period processes, the entropy 
rate /i^ for the period-16 process is zero; at sufficiently 
large L the process is perfectly predictable. In addi- 
tion to h^{L), defined above as H{L) — H{L — 1), we 



show h'^{L) = H{L)/L. The total entropy converges at 
L = 12. The value of iJ(12) = 4 bits reflects the fact that 
there are 16 equally probable sequences at each L > 12. 







11000:1 = 4.073 




10101:1 = 4.873 




10000:1 = 5.273 
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FIG. 7. (a) Entropy growth for all period-5 processes, along 
with the asymptote E -I- ft^L = log2 5 ~ 2.321 (thin dashed 
line), (b) Entropy convergence, for the same period-5 pro- 
cesses, (c) Predictability gain A^H{L). 



Template 


Number of Observations 


T 


Word 


to Synchronize [symbols] 


[bit-symbols] 


11000 


2.4 


4.073 


10000 


2.8 


5.273 


10101 


3.2 


4.873 



TABLE II. Synchronizing to period-5 processes: Compar- 
ing the transient information T to the average number of 
observations required to synchronize to the three distinct 
period-5 sequences. 
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The excess entropy E for the period-16 process is 
log2l6 = 4 bits; the sequence's past carries 4 bits of 
phase information about the future. Geometrically, E 
is the vertical-intercept of the horizontal asymptote on 
Fig. @(a) (dashed line) or the area under h^^L) on 
Fig. ^(b). The predictability is G = log22 = 1 bit per 
symbol; the system can be predicted perfectly. Finally, 
the transient information for this period-16 process is 
T Ri 16.6135 bit-symbols. Since this process is Marko- 
vian, Thm. |l| applies. Thus, we conclude that, on aver- 
age, an observer would measure a total uncertainty S of 
16.6135 bits during the process of synchronization. 



synchronize. 

In summary, this example shows that there are struc- 
tural differences between different periodic processes of 
the same period. The transient information is able to 
capture these differences, while the excess entropy is un- 
able to. Since many chaotic systems, for example, are 
a combination of periodicity and randomness, one sees 
that the transient information is useful in detecting syn- 
chronization to the ordered component of such processes. 



C. Markovian Processes 



2. T distinguishes period-p processes 



For any periodic process of period p, /i^ = and 
E = log2 p. However, there are important structural 
differences between different sequences with the same 
period. To show this, we consider all binary period- 
5 processes that are distinct up to permutations and 
(0 l)-exchanges in their "template" words. There 
are only three such processes: (11000)°°, (10101)°°, and 
(10000)°°. By the symmetries of the Shannon entropy 
function these processes illustrate the only three types 
of entropy convergence behavior possible for period-5 se- 
quences. 

Figure 0(a) shows the entropy growth curves for each; 
Fig. J^(b) gives the entropy convergence curves; and 
Fig. |j(c) gives the predictability gain A^iJ(L). By L = 4, 
H{L) converges to E = logj 5 w 2.321 bits. We see that 
h 



fj,{L) = at this and larger L. For all three processes, 
G = 1 — = 1 bit per symbol: Again, the information 
in each measurement concerns the periodic component 
of the process. The predictability gain per measurement 
vanishes at L = 6, since at that point all length-5 tem- 
plates have been completely parsed and the process ap- 
pears completely predictable. It is a useful exercise in 
understanding A^H{L) to work through each template 
symbol-by-symbol to see which symbols are more and 
less informative about each template's phase. For ex- 
ample, on the one hand, observing the fourth symbol 
of the (10101)°° process does not improve predictabihty. 
On the other hand, the third symbol for the (11000)°° 
process is highly informative and predictability increases 
markedly. 

Corollary ^ applies here and, since /i^ — 0, says that 
the synchronization information S is equal to T; and so, 
we can directly interpret T as the synchronization infor- 
mation. Table VI Bl gives the values of the transient 
information T, which are all different, indicating that an 
observer comes to synchr onize to the distinct templates 
differently. Table VI B 1 also gives the average number 
of observations required to synchronize. From this table, 
we see that T is not directly proportional to the num- 
ber of measurements to synchronize. Rather, it is the 
total amount of information that must be extracted to 



We now consider a simple Markovian process with a 
nonzero entropy rate h^. (The periodic systems of the 
previous section are Markovian, but with ft,^ = 0.) In 
particular, we shall consider the golden mean (GM) pro- 
cess, a Markov chain of order one. 

In terms of the sequences produced, the underlying 
golden mean system produces all binary strings with no 
consecutive Os. The probabilistic version — the golden 
mean process — generates Os and Is with equal probabil- 
ity, except that once a is generated, a 1 is seen. One 
can write down a simple two-state Markov chain for this 
process. The GM process is so named because the log- 
arithm of the total number of allowed sequences grows 
with L at a rate given by the logarithm of the golden 
mean, (j) = ^{l + V^)- 

The various entropy convergence curves for the GM 
process are shown in Fig. |[ The entropy rate of the 
GM process is /i^ = 2/3 bits per symbol and the pre- 
dictability is G = 1/3 bit per symbol. The convergence 
of h^{L) to /i^i occurs at sequence length L = 2. In 
other words, once the statistics over all possible length-2 
sequences are known, one gains no additional predictabil- 
ity by keeping track of the occurrence of blocks of larger 
length. There is, however, a large predictability gain in 
going from blocks of length 1 to blocks of length 2. Ob- 
serving that 00 is missing is the key observation that 
makes this system predictable. The predictability gain 
per symbol A^_ff (L) is shown in Fig. ||(c). Note that the 
second measurement is more informative than the first. 

We find that E « 0.2516 bits, and T = E, which can 
be easily deduced from the H{L) versus L graph in Fig. 
^(a). From these small values for E and T one concludes 
that not much historical information is needed to per- 
form optimal prediction, nor is there much uncertainty 
associated with synchronization. 

For this system we find that H{1) sa 0.9183 bits. Plug- 



ging this and our result for /i^ into Eq. (57), we see that 
the expression for the excess entropy of a Markovian pro- 
cess is verified. 

The behavior shown in Fig. || is typical for Marko- 
vian processes. For an order-i? Markovian process, the 
entropy density estimates h^(L) will always converge ex- 
actly to hf^ by L — R. This follows immediately from 
inserting Eq. (||) into the expression for h^{L), Eq. (|2F 
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Given this, we know that at H{R) 
this for E, we arrive at Eq. (151 



E + hf,R. Solving 
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FIG. 8. (a) Entropy growth (solid line) for the golden mean 
process, along with the asymptote E + hf^L (dashed line), 
(b) Entropy convergence, both hfj_{L) (solid line) and h'^^{L) 
(dashed line), for the same, (c) Predictability gain A^H{L) 
versus sequence length. 



D. Hidden Markov Processes I: Complex Transient 
Structure 

For our next three examples, we consider three dif- 
ferent finitary hidden Markov processes. Each of these 
examples contains some interesting surprises. We begin 
by considering a process that consists of two successive 
random symbols chosen to be or 1 with equal probabil- 
ity and a third symbol that is the logical Exclusive-OR 
(XOR) of the two previous. We call this the random- 
random-XOR (RRXOR) process. The entropy growth 
and convergence plots are given in Figs. ||(a) and ^(b). 



CM 
< 




FIG. 9. (a) Entropy growth (solid line) for the ran- 
dom-random-XOR process, along with the asymptote E+h^L 
(dashed line), (b) Entropy convergence, both hfj,{L) (solid 
line) and h'^{L) (dashed line), for the same, (c) Predictabil- 
ity gain A^H{L) versus sequence length. 
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FIG. 10. A least-squares fit (dashed line) to the exponen- 
tial decay of /i^(L) (squares) for the RRXOR process. 
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The entropy rate /i^ is 2/3 bits per symbol. To see this, 
note that two out of every three symbols are completely 
random, while one third of the symbols are determined 
by the previous two. Note further that the RRXOR pro- 
cess has the same /i^, and hence the same G, as the GM 
process of the previous section. This serves as yet an- 
other reminder that the entropy rate is not sufficient to 
distinguish the structural properties of a source. 

At first blush, one might expect the entropy growth 
curve to reach its asymptotic form at L = 3, just as 
H{L)did at L = 2 for the golden mean process. However, 
Fig. g(a) shows that this is not the case. The reason that 
it does not converge exactly at L = 3 is that the RRXOR 
process is not Markovian; specifically, the observed se- 
quences of O's and I's are not finite-order Markovian. 
The RRXOR is a hidden Markov process; its internal 
states are Markovian, but the observed "states" are not. 

Instead of converging exactly at finite L, the conver- 
gence of hfj.{L) to hfj^ is exponential: 



A2 



(80) 



where we find A = .60 ± 0.02 and 7 = .306 ± .004. This 
fit is illustrated in Fig. 

The excess entropy is E = 2 bits: one needs to know 
which of the four possible random symbol-pairs has oc- 
curred before one reaches a condition of optimal pre- 
dictability. Thus, the process has log2 4 = 2 bits of mem- 
ory. However, the transient information is quite large; 
T « 9.43 bit-symbols. This indicates that the process is 
difficult to synchronize to: Even after observing a large 
number of symbols, there is still some uncertainty about 
which internal, hidden state the process is in. Neverthe- 
less, the transient information is finite. 

For this system, H{1) = 1. Using Eqs. ^ and 



we find E.^ « 1.74 bits and « 9.12 bit-symbols. The 
differences from the near-exact values above indicate the 
amount of deviation from a pure exponential decay of 

Intriguingly, the behavior of the predictability gain 
A^iJ(i) of Fig. ^(c) shows strong hints of the structure 
of the hidden Markov model that generates the observed 
sequences. At lengths L = 1 and i = 2 symbols arc not 
informative at all: A^H{L) = 0. This reflects the fact, 
given by the process's definition, that two of the symbols 
are produced by fair coin flips. For larger L, note that 
A'^H{L) shows oscillations of period three. The RRXOR 
hidden Markov process also has a period-3 structure: af- 
ter the two random bits and the XOR bit, the hidden 
Markov model always resets to the same state. Recall, 
however, that A'^H{L) is formed from statistics over the 
observed symbols, not the hidden states of the process. 
Given this, it is somewhat surprising that A'^H{L) picks 
up the period-3 nature of the transitions between hidden 
states. 



E. Hidden Markov Processes II: Measure Sofic 
Process 

We now consider another hidden Markov process: the 
even process pTf , a stochastic process whose support (the 
set of allowed sequences) is a sofic system called the even 
system . The even system generates all binary strings 
consisting of blocks of an even number of Is bounded by 
Os. Having observed a process's sequences, we say that a 
word (finite sequence of symbols) is forbidden if it never 
occurs. A word is an irreducible forbidden word if it con- 
tains no proper subwords which are themselves forbidden 
words. A system is sofic if its list of irreducible forbidden 
words is infinite. The even system is one such sofic sys- 
tem, since its set {01^"+^0,7i = 0,1,...} of irreducible 
forbidden words is infinite. Note that no finite-order 
Markovian source can generate this or, for that matter, 
any other strictly sofic system. The even process then 
associates probabilities with each of the even system's 
sequences by choosing a or 1 with fair probability after 
generating either a or a pair of Is. The result is a mea- 
sure sofic process — a distribution over a sofic systems 
sequences. Like the RRXOR process, the even system is 
not Markovian, but a hidden Markov process. 

The various entropy convergence curves for the even 
process are shown in Fig. |ll]. The entropy rate of the 
even process is = 2/3 bits per symbol and the pre- 
dictability G is 1/3 bits per symbol. Note that these 
values are the same as those for the RRXOR and GM 
processes, again emphasizing the poverty of /i^ as a struc- 
tural measure. The convergence of h^(L) is exponential. 
A fit to 



(81) 



shown in Fig. |l|, yields A = .388 ± 0.019 and 7 = 
.501 ± .007. We find that E w 0.902 bits. This is the 
amount of storage required on average to hold the infor- 
mation that a given observed 1 is the "even" or "odd" 
symbol in a block of Is. The transient information is 
T 3.03 bit-symbols: The even process is moderately 
difficult to synchronize to, although it is much easier to 
synchronize to than the RRXOR process in the previous 



example. Since H{1) w 0.918, we find that E^ 



0.86 



and w 2.92, both of which agree well with the values 
measured for E and T. 

Again, the predictability gain per symbol A'^H{L), 
shown in Fig. |l^(c), oscillates as it converges to zero. The 
plot indicates that odd-length measurement sequences 
are more informative than even-length ones. As in the 
RRXOR example, the oscillation of A^H{L) provides a 
strong hint as to the underlying structure of the hidden 
Markov process responsible for the observed sequences. 
This process has two states and, thus, a strong period-2 
component. This periodic behavior in the hidden states 
is picked up in A^H{L), despite the fact that A^H{L) is 
based only on the statistics of the observed, nonhidden 
symbols. 
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FIG. 11. (a) Entropy growth (solid line) for the even pro- 
cess, along with the asymptote E + /i^L (dashed line), (b) En- 
tropy convergence, both /i^ (L) (solid line) and /i^ (L) (dashed 
line), for the same, (c) Predictability gain A^H{L) versus 
sequence length. 




F. Hidden Markov Processes III: The Simple 
Nondeterministic Source 



We now consider a process known as the simple nonde- 
terministic source (SNS). This process was constructed to 
iUustrate how measurement distortion can contribute its 
own kind of apparent structural complexity to a simple, 
but hidden, information source. In particular, the SNS 
describes the process obtained via a non-generating parti- 
tion of the logistic map 1 55 1 . For an introduction to issues 
of measurement- induced complexity see Ref. fssl , and for 
a full mathematical treatment see Ref. . Spatial ver- 
sions of this class of hidden process were introduced in 
Ref. Ql and analyzed from a computation theoretic view 
in Ref. ill. 
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FIG. 12. A least-squares fit (dashed line) to the exponen- 
tial decay of /i^ (L) (squares) for the even process. 
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FIG. 13. (a) Entropy growth (solid line) for the simple 
nondeterministic source, along with the asymptote E -I- /i^L 
(dashed line), (b) Entropy convergence, both hfj_{L) (solid 
line) and h'^{L) (dashed line), for the same, (c) Predictabil- 
ity gain A^H{L) versus sequence length. 
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The SNS, a hidden Markov process, generates symbol 
sequences as follows. The system has three internal, hid- 
den states: A, B, and C. The observer, however, only 
sees the binary outputs and 1. The probabilities of 
generating the observed symbols, when the process is in 
each of the internal states, are given by the transition 
matrices T(°) and T(i), respectively: 



and 



T 



7^(0) 



(1) - 




1/2 




1/2 1/2 
1/2 
Ll/2 1/2 



(82) 



(83) 



The elements of the transition matrices are identified 
with the set of internal states {A, B, C} in the natural 

way. For example, = 1/2 indicates that the proba- 
bility of being in state B, producing a 1, and making a 
transition to state C is 1/2. 

Assuming that the observer knows the internal struc- 
ture of the process — i.e., T^^^ and T^^^ — then when- 
ever a 1 is measured the observer knows that the inter- 
nal state is C. However, for every measured after this, 
the observer becomes and then remains uncertain as to 
whether the internal state is A or B. This also explains 
the label "nondeterministic" for this process: the mea- 
surement of does not determine the internal state. In 
contrast, all the previous examples we have considered 
have been deterministic, in the sense that specifying the 
output symbol determines the next internal state. 

A central consequence of this nondeterminism is that 
the number of effective states seen by an observer that 
attempts to reconstruct the hidden process is infinite, 
even though the internal process is a simple, three-state 
Markov chain |5^,|5^. The SNS is arguably one of the 
simplest such examples for which this infinite-state di- 
vergence occurs. 
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FIG. 15. A log-log plot to test for a power-law decay of 
hfi[L) (squares). The latter are calculated exactly for se- 
quences from L = 2 to L = 25. The dashed line represents a 
power-law decay. 

The various entropy convergence curves for the SNS 
process are shown in Fig. ^ The entropy rate, calcu- 
lable analytically, is sa 0.6778 bits per symbol and 
the predictability is G w 0.3222 bits per symbol. We 
find that E « 0.147 bits, there is not much mutual in- 
formation between the past and future, and T w 0.175 
bit-symbols. 

Interestingly, the functional form of hf^{L) — /i^ is not 



clear. An exponential decay 



(84) 



is shown as the dashed line, with A — 0.05 and 7 — 1.35, 
in Fig. |lj. One can also test a power-law entropy decay 
of the form 



= cL- 



(85) 
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FIG. 14. A semilog plot to test for an exponential decay 
of hi_i{L) (squares). The latter are calculated exactly for se- 
quences from L = 2 to L = 25. The dashed line represents an 
exponential decay. 



This is shown as the dashed line, with c = 1.0 and 
a = 7.0, in Fig. |l^. Neither form is ideal: entropy con- 
vergence is slower than exponential and faster a power 
law. Based on Figs. |ll and |l5| one cannot infer a simple 
functional form for h^L) — h^; perhaps it is some version 
of a stretched exponential. 

In short, the simple nondeterministic source has low 
predictability and low apparent memory. Moreover, since 
T is small, synchronizing to it entails overcoming very lit- 
tle uncertainty. These would seem to be in accord with 
the fact that one can write down a compact nondeter- 
ministic representation for it that has only a few hidden 
states. However, to perform optimal prediction, a deter- 
ministic representation is needed and for the SNS that 
representation has an infinite number of states psf . This 
degree of complexity is not suggested by the relatively 
small values for the information theoretic measures of 
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structure considered here. Thus, relying only on infor- 
mation theoretic quantities, one is misled as to the pro- 
cess's actual complexity. Nonetheless, the fact that en- 
tropy convergence is not clearly exponential, in contrast 
to the even and RRXOR processes, provides indirect ev- 
idence that the SNS is different from these other finitary 
sources. 



G. Aperiodic Infinitary Process 

We now consider an infinite-memory process that is 
aperiodic and has zero entropy rate. The Thue-Morse 
(TM) sequence is the fixed point of the substitution a 
defined by: 



cr(0) = 01 , 
cr(l) = 10 . 



(86) 
(87) 



For example, starting from the initial string 5 = 1, the 
fifth iterate in the TM sequence is: 



cr^(s) = 10010110011010010110100110010110. 



The Thue-Morse language Ltm is the subset of all words 
in the TM sequence: 



sub I 



( lim , 



(89) 



where sub(s) gives all of the subwords in string s. The 
Thue-Morse process is then given by assigning the nat- 
ural measure — the frequency of occurrence in a°°{l) 
— to the words in Ltm- Unlike the previous three ex- 
amples we have considered, the Thue-Morse process is 
not generated by a finitary hidden Markov process. In 
fact, there is no finite-state process that can generate the 
Thue-Morse sequence. 

The various entropy convergence curves for the TM 
process are shown in Figs. ^ and p^. These curves were 
calculated using the results of Ref. [p8|, which show that: 



h^W = 1 



(90) 
(91) 

Ki^) = ^ , (92) 

and, for /c > 1: 

/ 4/(3 • 2'=), if 2'= + 1 < L - 1 < 3 • 2'=-! 



/j42) = log2 3 - - , 
2 

3 ' 



\ 2/(3 • 2^=), if 3 • 2*^-1 1 < L - 1 < 2'^'+i 



(93) 



From this, one concludes that /i^ = and that the 
entropy-rate estimates converge according to a power 
law: h^{L) cx 1/L. Thus, the total entropy grows loga- 
rithmically: H{L) (X log2l/; as shown in Fig. |l^(a). De- 
spite the slow convergence to = 0, the predictability 



is high: G = 1 bit per symbol. Each measurement gives 
the maximal amount of information about the nonran- 
dom part of the process. 
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FIG. 16. (a) Entropy growth (solid line) for the 
Thue-Morse process, (b) Entropy convergence, both /i,j(i) 
(solid line) and h'^{L) (dashed line), for the same. In (a) and 
(b) sequence length goes up to L = 5000. (c) Predictability 
gain A^H{L) over small ranges of L. 

Nonetheless, the excess entropy diverges; E(L) cx 
logjL, indicating an infinite-memory process. (See 
Fig. |l^(a).) This can be also be inferred from Fig. [l6|(a), 
where E is simply the height of the H{L) curve, since 
hf^ = 0. Finally, the transient information estimate T(L) 
also diverges, linearly, as shown in Fig. p^(b). This lin- 
ear divergence is explained by looking at Eq. (^4|). If one 
substitutes hfj,{L) ~ and — Q into the expression 
there for T, the linear divergence follows immediately. 
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It is clear from Fig. |6|(c) that there are long sequences 
of measurements that are uninformative. These are punc- 
tuated occasionally with isolated symbols that do im- 
prove predictability. These occur at sequence lengths 
= 3 X 2'-3 j^2, i = 3,4,5, .. .. To determine why 
A'^H{L) behaves in this manner requires a computation 
theoretic approach, such as that given in Ref. for the 
symbolic dynamics produced at the period-doubling ac- 
cumulation point of the logistic map. For another similar 
approach, see Ref. |60|. 




with a > 5/2 for a class of one-dimensional intermit- 
tent maps. Thus, this class consists of finitary processes. 
However, for a different model of an intermittent process, 
Freund found a similar decay form, but with a « 0.492 
plf . Examining temporal-block sequences in elementary 
one-dimensional cellular automata, Grassberger [|l3| also 
found a power law decay, with a = 0.6 ±0.1 for rules 30 
and 45 and a = l.OiO.l for rule 120. These are examples 
of infinitary processes. 

A number of researchers have examined entropy con- 
vergence for written texts — such as The Bible, Grimms' 
Tales, Moby Dick, the gnuplot manual, and Gleick's 
popular book "Chaos" |^,||,|2|-||| . The picture that 
emerges is that entropy convergence can be fit to a power 



law h^{L) 



with a ranging from 0.4 to 0.6. 



Interestingly, for a Beethoven sonata an exponent of 
a fa .75 has been found Again, these results in- 

dicate infinitary processes. 

Recently, Nemenman and Bialek, Nemenman, and 
Tishby Q have found power-law convergences for differ- 
ent one-dimensional Ising models. For long-range cou- 
pling, where the coupling constants decay as the inverse 
lattice separation, they found an a of 0.5. They also 
examined an Ising model with short-range interactions, 
but in which the coupling constant changes every 400, 000 
sites within a lattice of 10^ spins. The coupling constant 
was drawn from a Gaussian distribution with zero mean. 
For this system they found a power-law decay with an 
exponent of a = 1. 



I. Summary of Examples 



For comparison. Table VI I collects the various analyt- 
ical and numerical estimates of the information theoretic 
quantities for the preceding examples we analyzed. 



FIG. 17. (a) Excess entropy estimate divergence — E(L) 
(solid line) and E'(L) (dashed line), (b) Transient informa- 
tion estimate divergence — T(L) (solid line). Note that the 
sequence length goes up to L = 5000 and that both plots have 
large vertical scales. 

We conclude this section by noting that, based on our 
results and those of several other authors p^j3l| , |58| , |6l[ | , 
this 1/L entropy convergence is typical of aperiodic se- 
quences generated by substitutions rules like those of 
Eq. (IstI). Moreover, Freund, Ebeling, and Rateitschat 
have given an argument for why this entropy convergence 
form is characteristic of aperiodic sequences ||3l[|. 



H. Other Infinitary Processes 



Process 




G 


7 


E 


T 


Fair Coin 


1 













Biased Coin 


0.881 


0.119 










Period- 16 





1 




4 


16.6135 


(11000)°° 





1 




logaS 


4.073 


(10000)°° 





1 




logaS 


5.273 


(10101)°° 





1 




logaS 


4.873 


Golden Mean 


2/3 


1/3 




0.252 


0.252 


Even 


2/3 


1/3 


0.501 


0.902 


3.03 


Random- Random-XOR 


2/3 


1/3 


0.306 


2 


9.43 


Nondeterministic 


0.678 


0.322 


1.35* 


0.147 


0.175 


Thue-Morse 





1 




oc log L 


oc L 



TABLE III. Summary of Examples. *This can also be fit 
to a power law, L^" with a ~ 7. 



Before concluding this section, we review the results 
of several other investigations of entropy convergence. 
Szepfalusy and Gyorgyi found that /i^(i) — /i^ ^ L^°' 
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VII. APPLICATIONS AND IMPLICATIONS 

Being cognizant of various types of entropy conver- 
gence, of different classes of process, and of how to 
quantitatively distinguish between them is useful gen- 
eral knowledge. To this end, we reviewed information- 
theoretic quantities, introduced a new one, the transient 
information, and put forth a unified framework for relat- 
ing them all in terms of discrete derivatives and integrals. 
Then, in the preceding section, we analyzed a number of 
examples. We return now to the set of questions posed in 
the introduction: How can we untangle different sources 
of apparent randomness? In particular, what happens to 
our estimates of the entropy rate if we ignore a process's 
structure? 

Addressing these questions is the task of this last sec- 
tion. Here we show that there are direct and empirically 
important consequences for ignoring structural proper- 
ties. We consider several different questions: 

1. What happens when an observer ignores entropy- 
rate convergence? 

2. What happens when the process's apparent mem- 
ory is ignored? 

3. On the one hand, what happens if the observer ig- 
nores synchronization? 

4. On the other hand, what happens if the observer 
assumes it is synchronized to the process? 

The answers, given below, show that ignoring a pro- 
cess's structural properties leads to a range of misleading 
inferences about randomness and organization. In addi- 
tion to highlighting the negative consequences, we also 
comment on the fact that the associated problems can 
be alleviated to some extent, even in cases where data is 
limited. 



source properties be estimated at finite L? What errors 
are introduced and are these errors related in any way? 

The simplest such question, the first one listed above, 
arises when one attempts to estimate source randomness 
hf^ via the approximation Stopping the estimate 

at finite L gives one a rate hf^{L) that is larger than the 
actual rate /i^. That is, the source appears more random 
if we ignore correlations between variables separated by 
more than L steps. This observation follows directly from 
the definitions of and h^{L). However, it turns out 
that this form of overestimation of is related to the ex- 
cess entropy E. We shall see that there is a quantitative 
trade-off between randomness and memory. 

Assume an observer makes measurements of a process 
with entropy rate ft.^ and excess entropy E > 0. Re- 
call the definition, Eq. (p^), of the entropy rate. Using 
this definition to estimate is tantamount to assuming 
that E = — see the dashed line /i^L in Fig. |[ But, 
by assumption, E > 0. Thus, at a given L, we can ask 
what the entropy estimate h'^{L) = H{L)/L is. Lemma 
|l| established that h'^{L) > /i^. But by how much more? 
This is answered in a straightforward way by the follow- 
ing proposition. 

Proposition 13 When the observer is synchronized to 
the process, 



K{L) 



E 



(94) 



Proof: The claim follows immediately from the graphical 
construction given in Fig. |l^. Saying that the observer is 
synchronized to the source means using an L such that 
H{L) = E 



h^L. Thus, 



K{L) 



H{L) 



L 



E 



h^L 



(95) 



A. Disorder as the Price of Ignorance 

The first two questions are closely related and rather 
straightforward to answer. The preceding sections de- 
fined several different quantities — h^, G, E, and T — 
that measure randomness, predictability, memory, syn- 
chronization, and other features of a process. For the 
most part, these are asymptotic quantities in the sense 
that they involve the behavior of the function H{L) in 
the L ^ oo limit. Thus, their exact empirical estimation 
demands that an infinite number of measurements (for 
accurate estimates of sequence probabilities) of infinitely 
long sequences be made. Obviously, other than by ana- 
lytic means, it is not possible to exactly calculate such 
quantities. Exact, L — > oo results are known for only a 
few special systems which are analytically tractable. 

This leads one to ask, even when sequence probabil- 
ities are accurately known, how well can these various 



Eq. (|l|) follows directly □ 

In this way, E bits of memory are converted into addi- 
tional, apparent randomness. The process appears more 
random due to the observer ignoring one of its structural 
properties. 

One can object to this estimate: Typically one does 
not know the process's properties (e.g., E and h^) and 
so even these must be estimated. Thus, expressing the 
estimator h'^ in terms of the asymptotic quantities E and 
may not be that useful. However, E'(L) in Eq. ( |6l| ) 
is a non-asymptotic, L-dependent estimator of memory. 
Namely, E'(L) is a measure of the mutual information 
between two halves of an L-block. Using this estimator 
we can restate Proposition O. 



Proposition 14 



h'^{L) - h,{L) > 



E'(L) 



(96) 
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Proof: Again, see Fig. 18, Appealing to the monotonicity 
and convexity of H(L), the monotonicity of h^{L), and 
Lemma |], we can rewrite the definition 



E{L) = H{L) ^ h^{L)L , 



(97) 



as 



H{L) 
L 



E(L) 
L 



Since E'(i) is bounded above by E(L) by Lemma ||, we 
have 



h'{L)-h^{L)> 



L 



(99) 



which directly proves the claim. □ 

This result establishes how h^{L) lower bounds h'^{L), 
as indicated by Lemma |l|. In particular, it emphasizes 
that their difference is controlled by the excess entropy, 
a measure of memory. 

Although E is an i-asymptotic quantity, the error E/L 
in the entropy-rate estimate dominates at small L. More- 
over, being restricted to small L is typical of experimental 
situations with limited data or in which drift is present. 
That is, one cannot reliably estimate the L-block proba- 
bilities Pr(s^) at large L due to the exponential growth 
in their number or the nonstationarity of block probabil- 
ities, respectively. 



h^,(L)L 



E(L)+h^^L)L 




H(L) 



E>0- 

E(L)-- 

E = 



FIG. 18. Ignored memory is converted to randomness: Il- 
lustration of how ignoring memory, in this case implicitly as- 
suming E = as Eq. (^ implies, when actually E > 0, leads 
to an overestimate h'^{L) of the actual entropy rate /i^. 



measures entropy density and related quantities from ob- 
served symbol sequences. In a more general modeling set- 
ting, however, one always runs the risk of over-fitting and, 
in so doing, "projecting" some particular structure — such 
as, additional memory capacity — onto the system. As- 
suming a fixed, nonzero value for the excess entropy is, 
in an abstract sense, an example of over-fitting. Given 
this, we ask. What is the consequence of assuming a fixed 
value for E? 

Equivalently, what happens if the observer assumes 
that it is synchronized to the process at some finite L, 
implying that H{L) = E + /i^L? The geometric con- 
struction for this scenario is given in Fig. [l^. In effect 
the source is erroneously considered to be a completely 
observable Markovian process in which, as we have seen, 
H{L) converges to its asymptotic form exactly at some 
finite L. If the observer then uses Eq. ( p7| ) to estimate 
/i^t using its assumed value for E, one arrives at the esti- 
mator where 



H{L) -E 
L 



(100) 



At a given L the effect is that the observer considers the 
source to have a larger E than it actually has at that 
L. The line E + h^L is fixed at E when that intercept 
should be lower. The result, easily gleaned from Fig. |lj, 
is that the entropy rate /i^ is underestimated as /i^. In 
other words, the source appears more predictable than it 
actually is. 



Proposition 15 An observer monitors a process with 
excess entropy E > 0. // the observer assumes it is syn- 
chronized when it is not, then 



(101) 



Proof: From Fig. |l9| or Eq. (100), one sees that 
^ H{L)~-E 



(102) 



B. Predictability and Instantaneous Synchronization 



The observer is assuming that it is seeing H{L) = E 
/ipL. But since H{L) < E -t- /i^L, we have that 



Conversely, if one assumes a fixed amount of memory 
E, we shall see that this leads to an underestimate of 
the entropy rate h^. Assuming a fixed excess entropy 
is not something that one would be likely to do in the 
particular setting here, in which an observer empirically 



E + hf,L<'E + hf,L , 



(103) 



and so h^ < /i^. □ 
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E > ' 



E = 0- 




E = H(L) ~ hf,L . 



(105) 



H(L) 



FIG. 19. Assumed synchronization converted to false pre- 
dictability; Schematic iUustration of how assuming one is syn- 
chronized to a process, leads to an underestimate for a 
source with excess entropy E > and entropy rate /i^. 



C. Assumed Synchronization Implies Reduced 
Apparent Memory 

In addition to analyzing the effects on the apparent en- 
tropy rate due to assuming synchronization, we can ask 
a complementary question: What are the effects on esti- 
mates of the apparent memory E? Figure ^ illustrates 
this situation. 



E = 0-- 




E +/i^L 



H(L) 



FIG. 20. Assumed synchronization leads to less appar- 
ent memory: Schematic illustration of how assuming syn- 
chronization to a source, in this case implicitly assuming 
H{L) = E + hfj^L, leads to an underestimate E of the actual 
memory E > 0. 

If, at a given L, we approximate the entropy rate esti- 
mate H{L) — H{L — 1) by the true entropy ft,^, then the 
offset between the asymptote and H{L) is simply 



AE = E + /i^L - H{L) . 



(104) 



Translating this back to the original we have a reduced 
apparent memory E < E of 



In fact, since the estimated entropy rate is larger than 
/i^t, the reduction in apparent memory is even larger. 

Thus, assuming synchronization, in the sense that 
h^{L) = h^, leads one to underestimate the apparent 
memory, as measured by the excess entropy E. 



VIII. CONCLUSION 

Looking back, we have introduced a variety of informa- 
tion theoretic measures of a process's randomness and a 
variety of structural properties. Along the way, we put 
forth a new quantity, the transient information T. One 
of the central results of this work is contained in Theorem 
|l|, where we proved that T is related to the total state- 
uncertainty experienced while synchronizing to a Markov 
process. 

We also calculated these information theoretic quanti- 
ties for a range of differently structured processes. A nat- 
ural question, then, is: To what extent does this informa- 
tion theoretic approach allow us to distinguish between 
processes that are structured in fundamentally different 
ways? 



A. Process Classification 



To summarize our results from Section VI, we now 



give a rough classification of several types of informa- 
tion source based on the quantities studied here. Similar, 
although coarser, classifications have been put forth by 
Szcpfalusy [||, Ebeling (SS), and Crutchfield ||. 

First, we have the zero entropy rate, asymptotically 
predictable processes. 

1. Periodic processes: For period-p processes, H{L) 
becomes a constant and h^{L) vanishes for L > p. 

2. Aperiodic processes: These are infinitary processes, 
since they need, in a crude sense, an infinite amount 
of memory to maintain their aperiodicity. Having 

= 0, they cannot be aperiodic by virtue of an in- 
ternal source of randomness. T diverges, indicating 
that one is never fully synchronized. 

Then we have the positive entropy rate, irreducibly 
unpredictable processes. 



1. Memoryless processes: 



h^L and h^{L) converges immediately to 



For these, H{L) scales as 
We 

have E = and T = 0. Independent, identi- 
cally distributed (IID) processes are examples of 
this class. They have no temporal memory and no 
structural complexity. 
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2. Finitary processes: In this class H{L) scales as 
E + h^L. The entropy density h^{L) typically con- 
verges exponentially to /i^. We have < E < oo 
and T > 0. Ref. ||5^ established a useful connec- 
tion between information and ergodic theories for 
this class: finite E means that a process is weak 
Bernoulli. Within the finitary class further struc- 
tural distinctions are possible: 

(a) Markov processes: The basic property of 
Markovian sources is that one synchronizes to 
them exactly at some finite block length L. 
For these processes, the effective states can be 
taken to be single symbols or symbol blocks 
of some finite length. Once that length of se- 
quence has been parsed, the observer is syn- 
chronized and can then optimally predict the 
process. 

(b) Deterministic hidden Markov processes: 
These processes are characterized by an ex- 
ponential convergence of h^{L), in contrast 
to the exact convergence at finite L exhib- 
ited by a Markov process. Depending on 
the transition structure of the hidden states, 
these processes can have relatively large val- 
ues for the excess entropy and transient in- 
formation. Within this broad class of hidden 
Markov processes lies the interesting case of 
a measure sofic process — a system whose 
support set contains an infinite list of irre- 
ducible forbidden words. In a limited sense, 
these systems have an infinite memory that 
keeps track of the (infinite) list of irreducible 
forbidden words. Nevertheless, the measure 
sofic process considered here, the even pro- 
cess, had finite E and T. As noted above, the 
behavior of A^_ff (L) for these processes seems 
to provide strong hints of the structure of the 
hidden state transitions responsible for the 
infinite memory. In particular, we find that 
A^H{L) oscillates with a periodicity given 
by the periodic structure of the transitions 
between hidden states. 

(c) Nondeterministic hidden Markov processes: It 
would appear that this class of process may 
not be overtly different from other finitary hid- 
den Markov processes. However, the exam- 
ple we considered, the simple nondeterministic 
source, showed a markedly different entropy 
convergence behavior than the other hidden 
Markov examples. 

3. Infinitary Sources: At this point in time, this re- 
mains a catch-all category of processes — those 
falling outside the finitary classes. These include, 
for example, various context-free languages, such as 
positive-entropy-rate variations on the Thue-Morse 
process and other stochastic analogues from higher 



up the Chomsky hierarchy. Presumably, within the 
infinitary sources there are many interesting struc- 
tural distinctions waiting to be discovered; some 
analogous to the automata-architectural distinc- 
tions recognized by discrete computation theory 
[l65| and some distinctions related to the nature of 
the measure over the infinite sequences. 

The ultimate goal of this type of classification would 
be an amalgamation of the structural distinctions made 
in the Chomsky hierarchy of computation theory p5[ and 
statistical categories found in the ergodic theory hierar- 
chy of stochastic processes | |6^. 

Recent work by Nemenman M and Bialek, Nemenman, 
and Tishby H may be a helpful step in this direction. In 
Refs. 1^,^ they show that the excess entropy — the "pre- 
dictive information" in their parlance — is, in some cir- 
cumstances, related to the number of parameters in the 
model producing the process. However, this result holds 
in a slightly different context than ours. Rather than us- 
ing histograms of larger and larger variables blocks, they 
consider a procedure in which an observer is trying to 
learn a distribution through successive samplings. 



B. Inferring Models from Finite Resources 



In Section VII we considered various trade-offs between 



finite-L estimates of the excess entropy E, the transient 
information T, and the entropy rate h^. In particular, 
we have shown that not taking one or another into ac- 
count leads one to systematically over- or underestimate 
a source's entropy rate /i^. For example, there can be an 
inadvertent conversion of ignored memory into apparent 
randomness. The magnitude of this effect is proportional 
to the difference between source memory and the upper 
bound on memory that the observer can estimate. In 
a complementary way, one can inadvertently convert as- 
sumed memory into false predictability. One eventually 
comes to see that a process's structural features must be 
accounted for, even if one's focus is only on an apparently 
simpler question of (say) how random a process is ||67|| . 



C. Future Directions 



We conclude by mentioning some important open ques- 
tions and suggesting several directions for future re- 
search. First, at a number of points we have referred 
to "structure", without actually defining it. Is there a 
better, more systematic, and principled approach for de- 
termining the structure of an information source than the 
pure information-theoretic one just outlined? Refs. p^ ] 
and psf , for example, argue that computational mechan- 
ics is a viable approach to quantifying source structure 
and the patterns produced by information sources. They 
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show that the e-machine representation used there cap- 
tures aU of a source's structure. Thus, one natural ques- 
tion is how one can determine entropy convergence be- 
havior given a process's e-machine. 

Second, it would be helpful to make a direct connection 
between the source characterization developed here — in 
terms of average source properties measured by /i^j, E, 
T, and G — and the difficulty of estimating these quan- 
tities and of inferring models of the sources. Analyzing 
the computational complexity of these two problems is 
the domain of computational learning theory [^,^ . 

Third, establishing that the source entropy rate is 
a metric invariant is one of the hallmarks of ergodic and 
dynamical systems theories ||70|-[7^. What status do E 
and T hold in the same setting? 

Finally, there is, of course, the question of how the 
information theoretic approach to structure outlined 
here can be extended to more than one dimension. 
There has been some preliminary work in this direction 
lOI'^ 52 T^ ItgII; however, many questions remain. One 
of the central difficulties is that, unlike in one dimension 
where the various expressions for the excess entropy are 
equivalent, they yield different results when extended to 
two dimensions |77| . Careful definitions and distinct in- 
terpretations of the different forms of two-dimensional ex- 
cess entropy and related quantities will have to be given 
in order to develop a useful, fully two-dimensional ap- 
proach to pattern and structure. Our hope is that the 
preceding development is sufficiently clear and thorough 
that it can serve as a firm foundation for an information 
theory of structure in higher-dimensional processes. 
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APPENDIX A: PROOFS 
1. Prop. § 

Proposition |l]: AiJ(L) = 2:'[Pr(s-^)||Pr(s-^^i)]. 
Proof. By direct calculation we have the following. 



I?[Pr(s^)||Pr(.^-i)] = ^ Pr(s^)log2-^^^ (Al) 

= J2 Pr(s^)log2Pr(s^) - 

J2 E Pr(s^)log2Pr(s^-i) (A2) 

= H{L)- E log2Pr(s^-') E (A3) 
= HiL)-H{L-l) , 



(A4) 



since Pr(s^-i) = E{5^_i} Pr(s )■ □ 



2. Prop, g 

Proposition ^ 

A^H{L) = -I?[Pr(si_i|s^-2)||Pr(si_2|.s^-^)]. 

Proof: By the expressions for the second discrete deriva- 
tive, Eq. (pO|) and Eq. (Al), we have: 



A^H{L) = AH{L) - AH{L-l) (A5) 
= -EPr(s^)log2 Pr(sL-i|s^~^) 

+ E Pr(.s^-')log2Pr(sL_2|s^-') (A6) 



Pr(si_2|s^-') 



= -I?[Pr(si_i|s^-i)||Pr(si_2k^-')] . (A8) 



□ 



3. Prop. 



Proposition ||: G = R. 

Proof: Wc write the sum of Eq. (^J) and use the anti- 
differentiation formula Eq. (O) to get: 



M 



E A^H{L) = AH{M) - AH{0) 



(A9) 



L=l 



Since lim^^oo AH{L) = and since we have defined 
AH{Q) — logjjyll, it follows immediately that 



G = lim [AH{Q) - AH{M)] 

A/^oo 



= logsl^^l - h^, 
which is R by Eq. (|l|).n 



(AlO) 
(All) 
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4. Prop. § 

Proposition g: G = -J2T=2{L-l)^^H{L). 
Proof: We write Eq. (^) as a partial sum as follows: 

oo 



L=2 



lim 

M— ►oo 



M 



M 



.L=2 



L=2 



(A12) 



We use Eqs. (|22| ) and on the first and second terms 
on the right-hand side and obtain, after simplifying: 



L=2 



J2{L-1)A^H{L) = 

MA^H{M) - A^H{L) 



lim 



M 



L = l 



(A13) 



From the definition of G, Eq. (Q), and since we assume 
that G is finite, limM^oo LA^H{L) — 0. From this we 
see immediately that 

oo oo 

-Y{L-1)A^H{L) = YA^H{L) = G . (A14) 



i=2 



□ 



5. Prop. I 

Proposition |: E = -Y,'^^.^{L-l)A^H{L). 

Proof: Writing the right-hand side of the above equation 
as a partial sum, and then using the integration-by-parts 
formula Eq. (^2|) we obtain, after some algebra: 



-Y,{L~l)A^H{L) = 

L=2 

lim \~MAH{M) + ^AiJ(L)l . (A15) 

[ L=l J 

Recalling that AH{L) — h^{L) and that h^{M) in 
the M oo limit, we see at once that 

oo oc 

-^(i-l)A2i/(L) = Y.^h,{L)-h,] ^ E. (A16) 



L=2 



L = l 



The last equality follows from the definition of E, 
Eq. (H). □ 



6. Prop. 

Proposition |^: E = \iTaL^oQ[H{L) — /i^i]. 

Proof. Writing out the partial sum of the infinite sum in 
Eq. (Q) and evaluating it using the integration formula, 

M 

[AH{L) - h^] = H{M) - H{0) - h^M . (A17) 

L = l 



Since H{0) = 0, it then follows immediately that 
E = lim [H{M) - hf,M] . 



(A18) 



Since, by Eq. ^ the left-hand side is R(i), the proof is 
complete. □ 



7. Prop. 



Proposition 53: E = I[S] S] 



Proof. We rewrite the definition so that we can use the 
finite-i forms of various entropies: 



I[S;S] = lim 1{S \S \ 

L— *CXD 



(A19) 



We begin with the definition of mutual information, 
Eq. (^), which expresses / as the difference between two 
entropies: 

=-ff[5%i?[/ I (A20) 

Recall that li\S ] = 

Using the conditional entropy chain rule we have 

H\t I 5^] -i/[5o,5i,...,^L-i|5-L,...,^-i] (A21) 



L-l 



pi\Si\s-LS-h+\ ■ ■ ■ s'i-i] 



(A22) 



Putting these together we have 



lim 

L — >oo 



L-l 



1=0 



(A23) 



In the L —^ oo limit, each term in the summand is equal 

(A24) 



to hf^. Thus, we see that 



I[S;S] = lim [H{L)~Lh^] , 

L — ►oo 



which is E by Prop. 0. □ 
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8. Lemma |^ 
Lemma 0: h'^{L) > h^{L) > h^^. 

Proof: We prove the right inequahty first. Since condi- 
tioning reduces entropy, 



K{L) > h^{L') ,yL>L' 
Now, recall that 

lim hf^{L) = hf^ . 

L — >C30 



(A25) 



(A26) 



Since, by Eq. (A25), the h^{Lys are nonincreasing as L 
increases, it follows that h^{L) > h^. 

We now prove the left inequality in the proposition. 



K{L) 



_H{L) 



1 ^ 

= y H[Si\Si^iSi-2 ■ ■ ■ Si 



(A27) 



i=l 



For all i < L, 

H[S,\S,^,S,^2 ■■■S,]> H[Sl\Sl-iSl-2 • • • ^i] • (A28) 
Thus, 

L 

L 



1 ^ 

h'f^iL) > -J2h[Sl\Sl-iSl-2---Si] 



(A29) 



—LH[Sl\Sl-iSl-2 ■ ■ ■ Si] 



(A30) 
(A31) 



□ 



9. Lemma |^ 
Lemma |: E'(L) < E(i) < E. 
Proof: We first prove the right inequality. Recall that 

L 

E'(L) = H{L) - Lh^{L) = C^mC^^) - h^{L)) . 



M=l 



(A32) 



Since M > L for all terms in the summand, all elements 
of the sum are positive. Now, the excess entropy is de- 
fined as 



E^ hm J2{h,,{M)~h,,{L)) 



(A33) 



M=l 



Thus, E(L) is the partial sum of the above term. Since 
all terms in the sum are non-negative, it follows immedi- 
ately that the partial sum E(L) is less than the infinite 
sum E. 



We now prove the left inequality. Using stationarity, 

E'(i) = 2ff(L/2) - . (A34) 

Recall that for odd L, we defined E'(i) = E'(i - 1). To 
prove the left inequality, it will suffice to show that: 

2H{L/2)-H{L)<H{L)-Lhf,{L) . (A35) 

Rearranging, we have: 

2H{L/2) < 2H{L) - Lhf,{L) . (A36) 

By the concavity of H{L), 2H{L/2) > H{L), and thus 
the above equation becomes: 

H{L) < 2H{L) - Lhf,{L) . (A37) 

Rearranging again, we see that we need to show: 

H{L) > Lh^{L) . (A38) 

That this equation is true can be seen geometrically by 
inspecting Fig. |l8|. Note that the inequality is saturated 
if and only if the process is independent identically dis- 
tributed. 



To verify Eq. ( A38 ) algebraically, we use the chain rule 
on the left hand side and obtain: 



H[Sm\Sm-iSm-2 ■■■Si\> Lh^{L) . (A39) 



Af=l 



But, 

L 



H[Si\f\Si\f-iSM-2 ■ ■ ■ Si] 



M=l 



> Y ^^^l\Sl-iSl-2---Si] 

M=l 

= Lh^{L) . 



(A40) 
(A41) 



Thus, Eq. (|A3q ) is true, and the proof is complete. □ 



APPENDIX B: EXPONENTIAL CONVERGENCE 
TO THE ENTROPY RATE 

It was claimed in the main text that hf^{L) — often 
vanishes exponentially fast for finitary sources. Why is 
this behavior so common? There are several ways to ar- 
gue for the ubiquity of exponential entropy convergence. 

First, note that if A^H{L) converges to exponen- 
tially fast, then hi^i_{L) = AH{L) must also converge ex- 
ponentially fast. Then, a direct calculation shows that 
A^H{L) < I{L), where I{L) is the mutual information 
between two variables separated by L symbols. Now, 
the two-variable mutual information is related to the 
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two- variable correlation function C{L). In particular, 
I{L) cx C'^{L). This result was first shown for binary 
sequences by Li Q| and later Reneralized to larger alpha- 
bets by Herzel and Grosse As a result, if the corre- 
lations decay exponentially, then the two-symbol mutual 
information decays exponentially. This, in turn, allows 
one to conclude that the entropy-rate estimate converges 
exponentially and so E is finite. 

The conclusion from these observations is that expo- 
nential convergence of correlation functions implies the 
exponential convergence of the entropy rate. However, 
this only transfers the convergence question from entropy 
rates to correlation functions. So why is it that correla- 
tion functions typically decay exponentially? There are 
several answers to this question. 

Mathematically, many stochastic processes can be re- 
expressed as one-dimensional spin models; see, e.g., 
Ref. I?^. Thus, we expect that what is typical for 
spin systems will also be typical for the more general 
stochastic processes of interest to us here. In a one- 
dimensional statistical mechanical model with finite in- 
teraction strengths, one can always express the parti- 
tion function as an infinite product of transfer matri- 
ces. The correlation function between two spins L lattice 
sites apart is proportional to (Aq/Ai)^, where Aq is the 
largest eigenvalue of the transfer matrix and Ai the sec- 
ond largest eigenvalue. The Perron-Frobenius theorem 
guarantees that the largest eigenvalue is nondegenerate, 
thus establishing the exponential decay of the correla- 
tion function. This result is standard; see, for example, 
Ref. f^. 

Physically, in a spin system the sum of the correlation 
functions yields the magnetic susceptibility x- The ex- 
ponential decay of the correlation function thus ensures 
that X is finite. Hence, away from a critical point, where 
we expect finite response functions such as Xj we also ex- 
pect exponentially decaying correlation functions — or 
at least correlations that decay faster than 1/L. 

Mathematically, it has been shown that, under a fairly 
wide range of circumstances, a statistical mechanical sys- 
tem with an analytic partition function necessarily has 
correlation functions that decay exponentially [ pO| . Un- 
like the Perron-Frobenius transfer matrix argument, the 
results in Ref. hold for systems in more than one 
dimension. 



corresponding to the eigenvalue 1 shall be denoted by tt 
and is normalized in probability; 



(C2) 



As is well-known, tTq gives the asymptotic probability of 
the state A £V. Equivalently, in terms of the i?-blocks. 



Or, simply 



Pr(^-i(A)) 



Pr(s«) 



(C3) 



(C4) 



where is understood to correspond to the AiYi state. 

Initially, before any measurements are made, we as- 
sume our distribution over V is given by tt; 



Pr(V|A,M) 



(C5) 



where A is the empty string. Hence, ?i(0) = H{tt}. 
Equivalently, it follows from Eqs. ([76|) and (C2) that 



7^(0) = H{R) . 



(C6) 



If we observe a particular symbol s'j^, we now know that 
the process must be in one of the states that correspond 
to symbol blocks whose first symbol is s[. We denote 
this set of states by: 



v.. 



{(p{s\s2 ■■■sr): SieA,2<i<R} 



(C7) 



Likewise, after we've observed the particular length L se- 
quence s'^, L < R, we know that the process must be in 
one of the states that corresponds to an i?-symbol block 
whose first L symbols are s'^; 

V^,L = {ipis'^SL+iSL+2 - ■ ■ sr) ■■ 

SieA,L + l<i<R}, L<R. (C8) 

The following properties of V^il follow immediately 
from the definition, Eq. (C8): 



V,. c V , 



(C9) 



if and only if .s^ ^ s' , (CIO) 



APPENDIX C: PROOF OF THE 
SYNCHRONIZATION INFORMATION 
THEOREM 

We begin by restating the theorem: 

Theorem y If the source is order- i? Markovian, then 



S = T + -R{R+l)h^ 



(CI) 



Proof: Since the transition probabilities are normalized, 
T is a stochastic matrix; X!b^a& ~ 1- The eigenvector 



and, 



V, . 



(Cll) 



Thus, the set of L-blocks {s^} induces a partition of the 
set of states {V}. For a given L there are at most A^ 
sets Vgi, each of which is a proper subset of V. (There 
are exactly A^ subsets of V if and only if there are no 
forbidden sequences.) The set Vgt has at most A^~^ 
elements. So, as more and more symbols are observed — 
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i.e., as L grows — the subsets V^l of V become more and 
more refined. For the Markovian case considered here, 
eventually enough symbols will be observed so that we 
know with probability 1 the state of the process. Since 
the Markovian states are in a one-to-one relation with 
the i?-blocks, we are guaranteed to know the state with 
certainty after R symbols have been observed. Hence, 
1-L{R) = 0. Observing subsequent symbols will not add 
to the state uncertainty since each observation uniquely 
determines the subsequent state. Thus, Ti-iL) — for 
L>R. 

For L < R, the distribution over the Markovian states 
V G V is given by: 



Pt{v\s^,M) = 



(C12) 



where tt* is a vector whose | V| components are given by: 

^ ' [0, otherwise ^ ' 

We are interested in calculating 7i(i), the average state- 
uncertainty after observing L symbols. In order to per- 
form this calculation, the following two properties of tt" 
will be necessary. 

First, for fixed s^, observe that summing (tt* )„ over 
its components v results in Pr(s^), the probability of that 
particular s^. This follows from the definition of (tt"''), 
Eq. Q: 



Pr(s^) . 



(C14) 
(CIS) 
(C16) 
(C17) 



Hence, Pr(V|s as given in Eq. (C12) is normalized 

over s^. 

Second, notice that (tt" )^ has only one nonzero entry 
for fixed L and fixed state A. This follows from not- 
ing that the particular state A £ V is associated with a 
particular i?-block (p~^(A). More formally, suppose that 
{t:^ )y has a nonzero entry for two different L-blocks, say 
and s' ; 



Then, by Eq. (|CI3|), it follows that: 

A £ VgL , and v e V^il , 
which, in tmn, implies that: 

VsL n V,,L ^ and ^ s'^ 



(C18) 
(C19) 
(C20) 



This last equation contradicts Eq. ( |C1C| ). Thus, the 
original proposition must be true: (tt^ )„ has only one 
nonzero entry — namely tt^, — for all possible s^'s. 

We are now read y to complete our calculation of 7i(i). 
Plugging Eq. ( C12| ) into Eq. ( fz^ ) and simplifying slightly, 
we have: 

^(^) EE('^'')"lo&('r^').+ 

EE('^'')-log2Pr(s'^)- (C21) 

Parenthetically, we note that Ti-iL) is the information 
gain: n{L) = P[7r'*'' | |Pr(s-^)]. By Eq. (IcTtI ), we can per- 
form the sum over v in the second term on the right-hand 
side of the above equation, and we obtain the entropy of 
an L-block, H{L). 

To evaluate the first term on the right-hand side, recall 
that (tt** )^ has only one nonzero entry for fixed L and 
fixed V. Using this, we see that 



E (-E('^'')"iog2K 

si \ vev 

= - E '^^ log2 7r„ 

= H[it] 
= H{R) . 



Thus, it follows that 

n{L). 



H{R)-H{L) iiQ<L<R 
iiL> R 



(C22) 

(C23) 
(C24) 



(C25) 



We now have an expression for 7i(£) in terms of H{L), 
and we finis h the proof with a direct calculation. Looking 
at Eq. (CI), one sees that it will suffice to show that: 



E^(^) 



RiR+l) 



(C26) 



L=0 



By assumption, the process is order-i? Markovian. This 
implies that n{L) = and H{L) = E + hf,L for all 
L > R. As a result of this latter equation, the summand 
of the infinite sum that defines T, Eq. (|63|), is zero for 
all L > R. That is, the last nonzero contribution to the 
sum comes at L = i? — 1. As a result, the left-hand side 



L=0 



of Eq. ( C26 ) can be written as: 

oo 

E^w-T = 

R-l 

E [H{R) - H{L) E h^L + H(L)] (C27) 
E [H{R) - E - h^L] . (C28) 



L=0 



L=0 
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But H{R) = E + hfj,R, since we assume that synchro- 
nization occurs at L = i?. Plugging this into the above 
equation, we have: 

oo _R-1 

n(L) - T = X! + ^t'^ ^ ^A"^] (C29) 



h^{R' ~ -R{R-\) 
^R{R+l) . 



(C31) 
(C32) 



L=0 



L=0 



R.-1 



This last equation is Eq. ( C26| ), thus completing the 
proof. □. 



= h^Y.^R-L) 



(C30) 



L=0 
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